[PATCH 0/8] drm/gem: Audit around handle

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] drm/gem: Audit around handle_create races
@ 2025-05-28  9:12 Simona Vetter
  2025-05-28  9:12 ` [PATCH 1/8] drm/gem: Fix race in drm_gem_handle_create_tail() Simona Vetter
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Simona Vetter @ 2025-05-28  9:12 UTC (permalink / raw)
  To: DRI Development; +Cc: intel-xe, Simona Vetter

Hi all,

Thanks to a report by Jacek Lawrynowicz I've crawled around in core and
driver code around drm_gem_handle_create() and found a bunch of issues.

Attached series is either fixes where I could do them, or RFC-style
patches that just add a comment about what looks wrong. The conversion
from idr_for_each_entry to idr_for_each only fixes temporary premature idr
iteration termination, and so fairly benign impact.

Testing and review very much welcome.

Cheers, Sima

Simona Vetter (8):
  drm/gem: Fix race in drm_gem_handle_create_tail()
  drm/fdinfo: Switch to idr_for_each() in drm_show_memory_stats()
  drm/panthor: Fix UAF in panthor_gem_create_with_handle() debugfs code
  accel/qaic: delete qaic_bo.handle
  drm/amd/kfd: Add comment about possible drm_gem_handle_create() race
  drm/amdgpu: Add comments about drm_file.object_idr issues
  drm/vmwgfx: Add comments about drm_file.object_idr issues
  drm/xe: Add comments about drm_file.object_idr issues

 drivers/accel/qaic/qaic.h                     |  2 -
 drivers/accel/qaic/qaic_data.c                |  1 -
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |  2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c       |  2 +
 drivers/gpu/drm/drm_file.c                    | 95 +++++++++++--------
 drivers/gpu/drm/drm_gem.c                     | 10 +-
 drivers/gpu/drm/panthor/panthor_gem.c         | 31 +++---
 drivers/gpu/drm/panthor/panthor_gem.h         |  3 -
 drivers/gpu/drm/vmwgfx/vmwgfx_gem.c           |  1 +
 drivers/gpu/drm/xe/xe_drm_client.c            |  3 +
 include/drm/drm_file.h                        |  3 +
 11 files changed, 90 insertions(+), 63 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/8] drm/gem: Fix race in drm_gem_handle_create_tail()
  2025-05-28  9:12 [PATCH 0/8] drm/gem: Audit around handle_create races Simona Vetter
@ 2025-05-28  9:12 ` Simona Vetter
  2025-05-28  9:26   ` Simona Vetter
                     ` (2 more replies)
  2025-05-28  9:13 ` [PATCH 2/8] drm/fdinfo: Switch to idr_for_each() in drm_show_memory_stats() Simona Vetter
                   ` (6 subsequent siblings)
  7 siblings, 3 replies; 27+ messages in thread
From: Simona Vetter @ 2025-05-28  9:12 UTC (permalink / raw)
  To: DRI Development
  Cc: intel-xe, Simona Vetter, Jacek Lawrynowicz, stable,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Simona Vetter

Object creation is a careful dance where we must guarantee that the
object is fully constructed before it is visible to other threads, and
GEM buffer objects are no difference.

Final publishing happens by calling drm_gem_handle_create(). After
that the only allowed thing to do is call drm_gem_object_put() because
a concurrent call to the GEM_CLOSE ioctl with a correctly guessed id
(which is trivial since we have a linear allocator) can already tear
down the object again.

Luckily most drivers get this right, the very few exceptions I've
pinged the relevant maintainers for. Unfortunately we also need
drm_gem_handle_create() when creating additional handles for an
already existing object (e.g. GETFB ioctl or the various bo import
ioctl), and hence we cannot have a drm_gem_handle_create_and_put() as
the only exported function to stop these issues from happening.

Now unfortunately the implementation of drm_gem_handle_create() isn't
living up to standards: It does correctly finishe object
initialization at the global level, and hence is safe against a
concurrent tear down. But it also sets up the file-private aspects of
the handle, and that part goes wrong: We fully register the object in
the drm_file.object_idr before calling drm_vma_node_allow() or
obj->funcs->open, which opens up races against concurrent removal of
that handle in drm_gem_handle_delete().

Fix this with the usual two-stage approach of first reserving the
handle id, and then only registering the object after we've completed
the file-private setup.

Jacek reported this with a testcase of concurrently calling GEM_CLOSE
on a freshly-created object (which also destroys the object), but it
should be possible to hit this with just additional handles created
through import or GETFB without completed destroying the underlying
object with the concurrent GEM_CLOSE ioctl calls.

Note that the close-side of this race was fixed in f6cd7daecff5 ("drm:
Release driver references to handle before making it available
again"), which means a cool 9 years have passed until someone noticed
that we need to make this symmetry or there's still gaps left :-/
Without the 2-stage close approach we'd still have a race, therefore
that's an integral part of this bugfix.

More importantly, this means we can have NULL pointers behind
allocated id in our drm_file.object_idr. We need to check for that
now:

- drm_gem_handle_delete() checks for ERR_OR_NULL already

- drm_gem.c:object_lookup() also chekcs for NULL

- drm_gem_release() should never be called if there's another thread
  still existing that could call into an IOCTL that creates a new
  handle, so cannot race. For paranoia I added a NULL check to
  drm_gem_object_release_handle() though.

- most drivers (etnaviv, i915, msm) are find because they use
  idr_find, which maps both ENOENT and NULL to NULL.

- vmgfx is already broken vmw_debugfs_gem_info_show() because NULL
  pointers might exist due to drm_gem_handle_delete(). This needs a
  separate patch. This is because idr_for_each_entry terminates on the
  first NULL entry and so might not iterate over everything.

- similar for amd in amdgpu_debugfs_gem_info_show() and
  amdgpu_gem_force_release(). The latter is really questionable though
  since it's a best effort hack and there's no way to close all the
  races. Needs separate patches.

- xe is really broken because it not uses idr_for_each_entry() but
  also drops the drm_file.table_lock, which can wreak the idr iterator
  state if you're unlucky enough. Maybe another reason to look into
  the drm fdinfo memory stats instead of hand-rolling too much.

- drm_show_memory_stats() is also broken since it uses
  idr_for_each_entry. But since that's a preexisting bug I'll follow
  up with a separate patch.

Reported-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Cc: stable@vger.kernel.org
Cc: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Signed-off-by: Simona Vetter <simona.vetter@intel.com>
Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
---
 drivers/gpu/drm/drm_gem.c | 10 +++++++++-
 include/drm/drm_file.h    |  3 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 1e659d2660f7..e4e20dda47b1 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -279,6 +279,9 @@ drm_gem_object_release_handle(int id, void *ptr, void *data)
 	struct drm_file *file_priv = data;
 	struct drm_gem_object *obj = ptr;

+	if (WARN_ON(!data))
+		return 0;
+
 	if (obj->funcs->close)
 		obj->funcs->close(obj, file_priv);

@@ -399,7 +402,7 @@ drm_gem_handle_create_tail(struct drm_file *file_priv,
 	idr_preload(GFP_KERNEL);
 	spin_lock(&file_priv->table_lock);

-	ret = idr_alloc(&file_priv->object_idr, obj, 1, 0, GFP_NOWAIT);
+	ret = idr_alloc(&file_priv->object_idr, NULL, 1, 0, GFP_NOWAIT);

 	spin_unlock(&file_priv->table_lock);
 	idr_preload_end();
@@ -420,6 +423,11 @@ drm_gem_handle_create_tail(struct drm_file *file_priv,
 			goto err_revoke;
 	}

+	/* mirrors drm_gem_handle_delete to avoid races */
+	spin_lock(&file_priv->table_lock);
+	obj = idr_replace(&file_priv->object_idr, obj, handle);
+	WARN_ON(obj != NULL);
+	spin_unlock(&file_priv->table_lock);
 	*handlep = handle;
 	return 0;

diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
index 5c3b2aa3e69d..d344d41e6cfe 100644
--- a/include/drm/drm_file.h
+++ b/include/drm/drm_file.h
@@ -300,6 +300,9 @@ struct drm_file {
 	 *
 	 * Mapping of mm object handles to object pointers. Used by the GEM
 	 * subsystem. Protected by @table_lock.
+	 *
+	 * Note that allocated entries might be NULL as a transient state when
+	 * creating or deleting a handle.
 	 */
 	struct idr object_idr;

-- 
2.49.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/8] drm/fdinfo: Switch to idr_for_each() in drm_show_memory_stats()
  2025-05-28  9:12 [PATCH 0/8] drm/gem: Audit around handle_create races Simona Vetter
  2025-05-28  9:12 ` [PATCH 1/8] drm/gem: Fix race in drm_gem_handle_create_tail() Simona Vetter
@ 2025-05-28  9:13 ` Simona Vetter
  2025-05-28  9:22   ` Simona Vetter
  2025-05-28 20:10   ` kernel test robot
  2025-05-28  9:13 ` [PATCH 3/8] drm/panthor: Fix UAF in panthor_gem_create_with_handle() debugfs code Simona Vetter
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Simona Vetter @ 2025-05-28  9:13 UTC (permalink / raw)
  To: DRI Development
  Cc: intel-xe, Simona Vetter, Rob Clark, Emil Velikov, Tvrtko Ursulin,
	stable, Simona Vetter

Unlike idr_for_each_entry(), which terminates on the first NULL entry,
idr_for_each passes them through. This fixes potential issues with the
idr walk terminating prematurely due to transient NULL entries the
exist when creating and destroying a handle.

Note that transient NULL pointers in drm_file.object_idr have been a
thing since f6cd7daecff5 ("drm: Release driver references to handle
before making it available again"), this is a really old issue.

Aside from temporarily inconsistent fdinfo statistic there's no other
impact of this issue.

Fixes: 686b21b5f6ca ("drm: Add fdinfo memory stats")
Cc: Rob Clark <robdclark@chromium.org>
Cc: Emil Velikov <emil.l.velikov@gmail.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: <stable@vger.kernel.org> # v6.5+
Signed-off-by: Simona Vetter <simona.vetter@intel.com>
Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
---
 drivers/gpu/drm/drm_file.c | 95 ++++++++++++++++++++++----------------
 1 file changed, 55 insertions(+), 40 deletions(-)

diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
index 246cf845e2c9..428a4eb85e94 100644
--- a/drivers/gpu/drm/drm_file.c
+++ b/drivers/gpu/drm/drm_file.c
@@ -892,6 +892,58 @@ void drm_print_memory_stats(struct drm_printer *p,
 }
 EXPORT_SYMBOL(drm_print_memory_stats);
 
+struct drm_bo_print_data {
+	struct drm_memory_stats status;
+	enum drm_gem_object_status supported_status;
+};
+
+static int
+drm_bo_memory_stats(int id, void *ptr, void *data)
+{
+	struct drm_bo_print_data *drm_data;
+	struct drm_gem_object *obj = ptr;
+	enum drm_gem_object_status s = 0;
+	size_t add_size;
+
+	if (!obj)
+		return 0;
+
+	add_size = (obj->funcs && obj->funcs->rss) ?
+		obj->funcs->rss(obj) : obj->size;
+
+	if (obj->funcs && obj->funcs->status) {
+		s = obj->funcs->status(obj);
+		drm_data->supported_status |= s;
+	}
+
+	if (drm_gem_object_is_shared_for_memory_stats(obj))
+		drm_data->status.shared += obj->size;
+	else
+		drm_data->status.private += obj->size;
+
+	if (s & DRM_GEM_OBJECT_RESIDENT) {
+		drm_data->status.resident += add_size;
+	} else {
+		/* If already purged or not yet backed by pages, don't
+		 * count it as purgeable:
+		 */
+		s &= ~DRM_GEM_OBJECT_PURGEABLE;
+	}
+
+	if (!dma_resv_test_signaled(obj->resv, dma_resv_usage_rw(true))) {
+		drm_data->status.active += add_size;
+		drm_data->supported_status |= DRM_GEM_OBJECT_ACTIVE;
+
+		/* If still active, don't count as purgeable: */
+		s &= ~DRM_GEM_OBJECT_PURGEABLE;
+	}
+
+	if (s & DRM_GEM_OBJECT_PURGEABLE)
+		drm_data->status.purgeable += add_size;
+
+	return 0;
+}
+
 /**
  * drm_show_memory_stats - Helper to collect and show standard fdinfo memory stats
  * @p: the printer to print output to
@@ -902,50 +954,13 @@ EXPORT_SYMBOL(drm_print_memory_stats);
  */
 void drm_show_memory_stats(struct drm_printer *p, struct drm_file *file)
 {
-	struct drm_gem_object *obj;
-	struct drm_memory_stats status = {};
-	enum drm_gem_object_status supported_status = 0;
-	int id;
+	struct drm_bo_print_data data = {};
 
 	spin_lock(&file->table_lock);
-	idr_for_each_entry (&file->object_idr, obj, id) {
-		enum drm_gem_object_status s = 0;
-		size_t add_size = (obj->funcs && obj->funcs->rss) ?
-			obj->funcs->rss(obj) : obj->size;
-
-		if (obj->funcs && obj->funcs->status) {
-			s = obj->funcs->status(obj);
-			supported_status |= s;
-		}
-
-		if (drm_gem_object_is_shared_for_memory_stats(obj))
-			status.shared += obj->size;
-		else
-			status.private += obj->size;
-
-		if (s & DRM_GEM_OBJECT_RESIDENT) {
-			status.resident += add_size;
-		} else {
-			/* If already purged or not yet backed by pages, don't
-			 * count it as purgeable:
-			 */
-			s &= ~DRM_GEM_OBJECT_PURGEABLE;
-		}
-
-		if (!dma_resv_test_signaled(obj->resv, dma_resv_usage_rw(true))) {
-			status.active += add_size;
-			supported_status |= DRM_GEM_OBJECT_ACTIVE;
-
-			/* If still active, don't count as purgeable: */
-			s &= ~DRM_GEM_OBJECT_PURGEABLE;
-		}
-
-		if (s & DRM_GEM_OBJECT_PURGEABLE)
-			status.purgeable += add_size;
-	}
+	idr_for_each(&file->object_idr, &drm_bo_memory_stats, &data);
 	spin_unlock(&file->table_lock);
 
-	drm_print_memory_stats(p, &status, supported_status, "memory");
+	drm_print_memory_stats(p, &data.status, data.supported_status, "memory");
 }
 EXPORT_SYMBOL(drm_show_memory_stats);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/8] drm/panthor: Fix UAF in panthor_gem_create_with_handle() debugfs code
  2025-05-28  9:12 [PATCH 0/8] drm/gem: Audit around handle_create races Simona Vetter
  2025-05-28  9:12 ` [PATCH 1/8] drm/gem: Fix race in drm_gem_handle_create_tail() Simona Vetter
  2025-05-28  9:13 ` [PATCH 2/8] drm/fdinfo: Switch to idr_for_each() in drm_show_memory_stats() Simona Vetter
@ 2025-05-28  9:13 ` Simona Vetter
  2025-05-29 12:31   ` kernel test robot
  2025-06-01 14:06   ` Adrián Larumbe
  2025-05-28  9:13 ` [PATCH 4/8] accel/qaic: delete qaic_bo.handle Simona Vetter
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Simona Vetter @ 2025-05-28  9:13 UTC (permalink / raw)
  To: DRI Development
  Cc: intel-xe, Simona Vetter, Adrián Larumbe, Boris Brezillon,
	Steven Price, Liviu Dudau, Simona Vetter

The object is potentially already gone after the drm_gem_object_put().
In general the object should be fully constructed before calling
drm_gem_handle_create(), except the debugfs tracking uses a separate
lock and list and separate flag to denotate whether the object is
actually initilized.

Since I'm touching this all anyway simplify this by only adding the
object to the debugfs when it's ready for that, which allows us to
delete that separate flag. panthor_gem_debugfs_bo_rm() already checks
whether we've actually been added to the list or this is some error
path cleanup.

Fixes: a3707f53eb3f ("drm/panthor: show device-wide list of DRM GEM objects over DebugFS")
Cc: Adrián Larumbe <adrian.larumbe@collabora.com>
Cc: Boris Brezillon <boris.brezillon@collabora.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Liviu Dudau <liviu.dudau@arm.com>
Signed-off-by: Simona Vetter <simona.vetter@intel.com>
Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
---
 drivers/gpu/drm/panthor/panthor_gem.c | 31 +++++++++++++--------------
 drivers/gpu/drm/panthor/panthor_gem.h |  3 ---
 2 files changed, 15 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
index 7c00fd77758b..f334444cb5df 100644
--- a/drivers/gpu/drm/panthor/panthor_gem.c
+++ b/drivers/gpu/drm/panthor/panthor_gem.c
@@ -16,9 +16,11 @@
 #include "panthor_mmu.h"
 
 #ifdef CONFIG_DEBUG_FS
-static void panthor_gem_debugfs_bo_add(struct panthor_device *ptdev,
-				       struct panthor_gem_object *bo)
+static void panthor_gem_debugfs_bo_add(struct panthor_gem_object *bo)
 {
+	struct panthor_device *ptdev = container_of(bo->base.base.dev,
+						    struct panthor_device, base);
+
 	INIT_LIST_HEAD(&bo->debugfs.node);
 
 	bo->debugfs.creator.tgid = current->group_leader->pid;
@@ -44,12 +46,10 @@ static void panthor_gem_debugfs_bo_rm(struct panthor_gem_object *bo)
 
 static void panthor_gem_debugfs_set_usage_flags(struct panthor_gem_object *bo, u32 usage_flags)
 {
-	bo->debugfs.flags = usage_flags | PANTHOR_DEBUGFS_GEM_USAGE_FLAG_INITIALIZED;
+	bo->debugfs.flags = usage_flags;
+	panthor_gem_debugfs_bo_add(bo);
 }
 #else
-static void panthor_gem_debugfs_bo_add(struct panthor_device *ptdev,
-				       struct panthor_gem_object *bo)
-{}
 static void panthor_gem_debugfs_bo_rm(struct panthor_gem_object *bo) {}
 static void panthor_gem_debugfs_set_usage_flags(struct panthor_gem_object *bo, u32 usage_flags) {}
 #endif
@@ -246,7 +246,7 @@ struct drm_gem_object *panthor_gem_create_object(struct drm_device *ddev, size_t
 	drm_gem_gpuva_set_lock(&obj->base.base, &obj->gpuva_list_lock);
 	mutex_init(&obj->label.lock);
 
-	panthor_gem_debugfs_bo_add(ptdev, obj);
+	INIT_LIST_HEAD(&obj->debugfs.node);
 
 	return &obj->base.base;
 }
@@ -285,6 +285,12 @@ panthor_gem_create_with_handle(struct drm_file *file,
 		bo->base.base.resv = bo->exclusive_vm_root_gem->resv;
 	}
 
+	/*
+	 * No explicit flags are needed in the call below, since the
+	 * function internally sets the INITIALIZED bit for us.
+	 */
+	panthor_gem_debugfs_set_usage_flags(bo, 0);
+
 	/*
 	 * Allocate an id of idr table where the obj is registered
 	 * and handle has the id what user can see.
@@ -296,12 +302,6 @@ panthor_gem_create_with_handle(struct drm_file *file,
 	/* drop reference from allocate - handle holds it now. */
 	drm_gem_object_put(&shmem->base);
 
-	/*
-	 * No explicit flags are needed in the call below, since the
-	 * function internally sets the INITIALIZED bit for us.
-	 */
-	panthor_gem_debugfs_set_usage_flags(bo, 0);
-
 	return ret;
 }
 
@@ -387,7 +387,7 @@ static void panthor_gem_debugfs_bo_print(struct panthor_gem_object *bo,
 	unsigned int refcount = kref_read(&bo->base.base.refcount);
 	char creator_info[32] = {};
 	size_t resident_size;
-	u32 gem_usage_flags = bo->debugfs.flags & (u32)~PANTHOR_DEBUGFS_GEM_USAGE_FLAG_INITIALIZED;
+	u32 gem_usage_flags = bo->debugfs.flags;
 	u32 gem_state_flags = 0;
 
 	/* Skip BOs being destroyed. */
@@ -436,8 +436,7 @@ void panthor_gem_debugfs_print_bos(struct panthor_device *ptdev,
 
 	scoped_guard(mutex, &ptdev->gems.lock) {
 		list_for_each_entry(bo, &ptdev->gems.node, debugfs.node) {
-			if (bo->debugfs.flags & PANTHOR_DEBUGFS_GEM_USAGE_FLAG_INITIALIZED)
-				panthor_gem_debugfs_bo_print(bo, m, &totals);
+			panthor_gem_debugfs_bo_print(bo, m, &totals);
 		}
 	}
 
diff --git a/drivers/gpu/drm/panthor/panthor_gem.h b/drivers/gpu/drm/panthor/panthor_gem.h
index 4dd732dcd59f..8fc7215e9b90 100644
--- a/drivers/gpu/drm/panthor/panthor_gem.h
+++ b/drivers/gpu/drm/panthor/panthor_gem.h
@@ -35,9 +35,6 @@ enum panthor_debugfs_gem_usage_flags {
 
 	/** @PANTHOR_DEBUGFS_GEM_USAGE_FLAG_FW_MAPPED: BO is mapped on the FW VM. */
 	PANTHOR_DEBUGFS_GEM_USAGE_FLAG_FW_MAPPED = BIT(PANTHOR_DEBUGFS_GEM_USAGE_FW_MAPPED_BIT),
-
-	/** @PANTHOR_DEBUGFS_GEM_USAGE_FLAG_INITIALIZED: BO is ready for DebugFS display. */
-	PANTHOR_DEBUGFS_GEM_USAGE_FLAG_INITIALIZED = BIT(31),
 };
 
 /**
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/8] accel/qaic: delete qaic_bo.handle
  2025-05-28  9:12 [PATCH 0/8] drm/gem: Audit around handle_create races Simona Vetter
                   ` (2 preceding siblings ...)
  2025-05-28  9:13 ` [PATCH 3/8] drm/panthor: Fix UAF in panthor_gem_create_with_handle() debugfs code Simona Vetter
@ 2025-05-28  9:13 ` Simona Vetter
  2025-05-28 15:15   ` Jeff Hugo
  2025-06-06 16:25   ` Jeff Hugo
  2025-05-28  9:13 ` [PATCH 5/8] drm/amd/kfd: Add comment about possible drm_gem_handle_create() race Simona Vetter
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Simona Vetter @ 2025-05-28  9:13 UTC (permalink / raw)
  To: DRI Development
  Cc: intel-xe, Simona Vetter, Jeff Hugo, Carl Vanderlip, linux-arm-msm,
	Simona Vetter

Handles are per-file, not global, so this makes no sense. Plus it's
set only after calling drm_gem_handle_create(), and drivers are not
allowed to further intialize a bo after that function has published it
already.

It is also entirely unused, which helps enormously with removing it
:-)

Since we're still holding a reference to the bo nothing bad can
happen, hence not cc: stable material.

Cc: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
Cc: Carl Vanderlip <quic_carlv@quicinc.com>
Cc: linux-arm-msm@vger.kernel.org
Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
Signed-off-by: Simona Vetter <simona.vetter@intel.com>
---
 drivers/accel/qaic/qaic.h      | 2 --
 drivers/accel/qaic/qaic_data.c | 1 -
 2 files changed, 3 deletions(-)

diff --git a/drivers/accel/qaic/qaic.h b/drivers/accel/qaic/qaic.h
index 0dbb8e32e4b9..7817ce18b8f2 100644
--- a/drivers/accel/qaic/qaic.h
+++ b/drivers/accel/qaic/qaic.h
@@ -213,8 +213,6 @@ struct qaic_bo {
 	bool			sliced;
 	/* Request ID of this BO if it is queued for execution */
 	u16			req_id;
-	/* Handle assigned to this BO */
-	u32			handle;
 	/* Wait on this for completion of DMA transfer of this BO */
 	struct completion	xfer_done;
 	/*
diff --git a/drivers/accel/qaic/qaic_data.c b/drivers/accel/qaic/qaic_data.c
index 1bce1af7c72c..797289e9d780 100644
--- a/drivers/accel/qaic/qaic_data.c
+++ b/drivers/accel/qaic/qaic_data.c
@@ -731,7 +731,6 @@ int qaic_create_bo_ioctl(struct drm_device *dev, void *data, struct drm_file *fi
 	if (ret)
 		goto free_bo;
 
-	bo->handle = args->handle;
 	drm_gem_object_put(obj);
 	srcu_read_unlock(&qdev->dev_lock, qdev_rcu_id);
 	srcu_read_unlock(&usr->qddev_lock, usr_rcu_id);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/8] drm/amd/kfd: Add comment about possible drm_gem_handle_create() race
  2025-05-28  9:12 [PATCH 0/8] drm/gem: Audit around handle_create races Simona Vetter
                   ` (3 preceding siblings ...)
  2025-05-28  9:13 ` [PATCH 4/8] accel/qaic: delete qaic_bo.handle Simona Vetter
@ 2025-05-28  9:13 ` Simona Vetter
  2025-05-28  9:13 ` [PATCH 6/8] drm/amdgpu: Add comments about drm_file.object_idr issues Simona Vetter
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Simona Vetter @ 2025-05-28  9:13 UTC (permalink / raw)
  To: DRI Development
  Cc: intel-xe, Simona Vetter, Felix Kuehling, amd-gfx, Simona Vetter

I've long ago stopped trying to fully understand all the locking in
amdkfd, so maybe this is safe for a contrived reason. It's definitely
not how this should be done. Considers this more a request for a
proper patch.

Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Signed-off-by: Simona Vetter <simona.vetter@intel.com>
Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 260165bbe373..aa51930a012b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1774,6 +1774,8 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
 	ret = drm_gem_handle_create(adev->kfd.client.file, gobj, &(*mem)->gem_handle);
 	if (ret)
 		goto err_gem_handle_create;
+	/* FIXME: Thou shall completely initialize the bo before calling
+	 * drm_gem_handle_create. Or explain why this is safe. */
 	bo = gem_to_amdgpu_bo(gobj);
 	if (bo_type == ttm_bo_type_sg) {
 		bo->tbo.sg = sg;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 6/8] drm/amdgpu: Add comments about drm_file.object_idr issues
  2025-05-28  9:12 [PATCH 0/8] drm/gem: Audit around handle_create races Simona Vetter
                   ` (4 preceding siblings ...)
  2025-05-28  9:13 ` [PATCH 5/8] drm/amd/kfd: Add comment about possible drm_gem_handle_create() race Simona Vetter
@ 2025-05-28  9:13 ` Simona Vetter
  2025-05-28  9:22   ` Simona Vetter
  2025-05-28  9:13 ` [PATCH 7/8] drm/vmwgfx: " Simona Vetter
  2025-05-28  9:13 ` [PATCH 8/8] drm/xe: " Simona Vetter
  7 siblings, 1 reply; 27+ messages in thread
From: Simona Vetter @ 2025-05-28  9:13 UTC (permalink / raw)
  To: DRI Development
  Cc: intel-xe, Simona Vetter, Alex Deucher, Christian König,
	Arvind Yadav, Shashank Sharma, Yunxiang Li, Frank Min,
	Kent Russell, Simona Vetter

idr_for_each_entry() is fine, but will prematurely terminate on
transient NULL entries. It should be switched over to idr_for_each,
which allows you to handle this explicitly.

Note that transient NULL pointers in drm_file.object_idr have been a
thing since f6cd7daecff5 ("drm: Release driver references to handle
before making it available again"), this is a really old issue.

Since it's just a premature loop terminate the impact should be fairly
benign, at least for any debugfs or fdinfo code.

Aside: amdgpu_gem_force_release() looks questionable and should
probably be revisited in the light of the revised hotunplug design
we're aiming for. But that's an entirely separate can of worms.

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Arvind Yadav <Arvind.Yadav@amd.com>
Cc: Shashank Sharma <shashank.sharma@amd.com>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Yunxiang Li <Yunxiang.Li@amd.com>
Cc: Frank Min <Frank.Min@amd.com>
Cc: Kent Russell <kent.russell@amd.com>
Signed-off-by: Simona Vetter <simona.vetter@intel.com>
Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index 2c68118fe9fd..90723b13fa7d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -249,6 +249,7 @@ void amdgpu_gem_force_release(struct amdgpu_device *adev)
 
 		WARN_ONCE(1, "Still active user space clients!\n");
 		spin_lock(&file->table_lock);
+		/* FIXME: Use idr_for_each to handle transient NULL pointers */
 		idr_for_each_entry(&file->object_idr, gobj, handle) {
 			WARN_ONCE(1, "And also active allocations!\n");
 			drm_gem_object_put(gobj);
@@ -1167,6 +1168,7 @@ static int amdgpu_debugfs_gem_info_show(struct seq_file *m, void *unused)
 		rcu_read_unlock();
 
 		spin_lock(&file->table_lock);
+		/* FIXME: Use idr_for_each to handle transient NULL pointers */
 		idr_for_each_entry(&file->object_idr, gobj, id) {
 			struct amdgpu_bo *bo = gem_to_amdgpu_bo(gobj);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 7/8] drm/vmwgfx: Add comments about drm_file.object_idr issues
  2025-05-28  9:12 [PATCH 0/8] drm/gem: Audit around handle_create races Simona Vetter
                   ` (5 preceding siblings ...)
  2025-05-28  9:13 ` [PATCH 6/8] drm/amdgpu: Add comments about drm_file.object_idr issues Simona Vetter
@ 2025-05-28  9:13 ` Simona Vetter
  2025-05-28  9:23   ` Simona Vetter
  2025-05-28  9:13 ` [PATCH 8/8] drm/xe: " Simona Vetter
  7 siblings, 1 reply; 27+ messages in thread
From: Simona Vetter @ 2025-05-28  9:13 UTC (permalink / raw)
  To: DRI Development
  Cc: intel-xe, Simona Vetter, Simona Vetter, Zack Rusin,
	Broadcom internal kernel review list

idr_for_each_entry() is fine, but will prematurely terminate on
transient NULL entries. It should be switched over to idr_for_each,
which allows you to handle this explicitly.

Note that transient NULL pointers in drm_file.object_idr have been a
thing since f6cd7daecff5 ("drm: Release driver references to handle
before making it available again"), this is a really old issue.

Since it's just a premature loop terminate the impact should be fairly
benign, at least for any debugfs or fdinfo code.

Signed-off-by: Simona Vetter <simona.vetter@intel.com>
Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Zack Rusin <zack.rusin@broadcom.com>
Cc: Broadcom internal kernel review list <bcm-kernel-feedback-list@broadcom.com>
---
 drivers/gpu/drm/vmwgfx/vmwgfx_gem.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c b/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c
index c55382167c1b..438e40b92281 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c
@@ -323,6 +323,7 @@ static int vmw_debugfs_gem_info_show(struct seq_file *m, void *unused)
 		rcu_read_unlock();
 
 		spin_lock(&file->table_lock);
+		/* FIXME: Use idr_for_each to handle transient NULL pointers */
 		idr_for_each_entry(&file->object_idr, gobj, id) {
 			struct vmw_bo *bo = to_vmw_bo(gobj);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 8/8] drm/xe: Add comments about drm_file.object_idr issues
  2025-05-28  9:12 [PATCH 0/8] drm/gem: Audit around handle_create races Simona Vetter
                   ` (6 preceding siblings ...)
  2025-05-28  9:13 ` [PATCH 7/8] drm/vmwgfx: " Simona Vetter
@ 2025-05-28  9:13 ` Simona Vetter
  2025-05-28  9:24   ` Simona Vetter
  7 siblings, 1 reply; 27+ messages in thread
From: Simona Vetter @ 2025-05-28  9:13 UTC (permalink / raw)
  To: DRI Development
  Cc: intel-xe, Simona Vetter, Daniel Vetter, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi

idr_for_each_entry() is fine, but will prematurely terminate on
transient NULL entries. It should be switched over to idr_for_each,
which allows you to handle this explicitly.

Note that transient NULL pointers in drm_file.object_idr have been a
thing since f6cd7daecff5 ("drm: Release driver references to handle
before making it available again"), this is a really old issue.

Since it's just a premature loop terminate the impact should be fairly
benign, at least for any debugfs or fdinfo code.

On top of that this code also drops the drm_file.table_lock lock while
iterating, which can mess up the iterator state. And that's actually
bad.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: intel-xe@lists.freedesktop.org
---
 drivers/gpu/drm/xe/xe_drm_client.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
index 31f688e953d7..2542f265a221 100644
--- a/drivers/gpu/drm/xe/xe_drm_client.c
+++ b/drivers/gpu/drm/xe/xe_drm_client.c
@@ -205,6 +205,7 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file)
 
 	/* Public objects. */
 	spin_lock(&file->table_lock);
+	/* FIXME: Use idr_for_each to handle transient NULL pointers */
 	idr_for_each_entry(&file->object_idr, obj, id) {
 		struct xe_bo *bo = gem_to_xe_bo(obj);
 
@@ -213,6 +214,8 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file)
 			xe_bo_unlock(bo);
 		} else {
 			xe_bo_get(bo);
+			/* FIXME: dropping the lock can mess the idr iterator
+			 * state up */
 			spin_unlock(&file->table_lock);
 
 			xe_bo_lock(bo, false);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/8] drm/fdinfo: Switch to idr_for_each() in drm_show_memory_stats()
  2025-05-28  9:13 ` [PATCH 2/8] drm/fdinfo: Switch to idr_for_each() in drm_show_memory_stats() Simona Vetter
@ 2025-05-28  9:22   ` Simona Vetter
  2025-05-28 20:10   ` kernel test robot
  1 sibling, 0 replies; 27+ messages in thread
From: Simona Vetter @ 2025-05-28  9:22 UTC (permalink / raw)
  To: DRI Development
  Cc: intel-xe, Simona Vetter, Rob Clark, Emil Velikov, Tvrtko Ursulin,
	stable, Simona Vetter

On Wed, May 28, 2025 at 11:13:00AM +0200, Simona Vetter wrote:
> Unlike idr_for_each_entry(), which terminates on the first NULL entry,
> idr_for_each passes them through. This fixes potential issues with the
> idr walk terminating prematurely due to transient NULL entries the
> exist when creating and destroying a handle.
> 
> Note that transient NULL pointers in drm_file.object_idr have been a
> thing since f6cd7daecff5 ("drm: Release driver references to handle
> before making it available again"), this is a really old issue.
> 
> Aside from temporarily inconsistent fdinfo statistic there's no other
> impact of this issue.
> 
> Fixes: 686b21b5f6ca ("drm: Add fdinfo memory stats")
> Cc: Rob Clark <robdclark@chromium.org>
> Cc: Emil Velikov <emil.l.velikov@gmail.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: <stable@vger.kernel.org> # v6.5+
> Signed-off-by: Simona Vetter <simona.vetter@intel.com>
> Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>

Ok I screwed up reading idr_for_each_entry() respectively
idr_get_next_ul() big time, it already copes with NULL entries entirely
fine.

Mea culpa.
-Sima

> ---
>  drivers/gpu/drm/drm_file.c | 95 ++++++++++++++++++++++----------------
>  1 file changed, 55 insertions(+), 40 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
> index 246cf845e2c9..428a4eb85e94 100644
> --- a/drivers/gpu/drm/drm_file.c
> +++ b/drivers/gpu/drm/drm_file.c
> @@ -892,6 +892,58 @@ void drm_print_memory_stats(struct drm_printer *p,
>  }
>  EXPORT_SYMBOL(drm_print_memory_stats);
>  
> +struct drm_bo_print_data {
> +	struct drm_memory_stats status;
> +	enum drm_gem_object_status supported_status;
> +};
> +
> +static int
> +drm_bo_memory_stats(int id, void *ptr, void *data)
> +{
> +	struct drm_bo_print_data *drm_data;
> +	struct drm_gem_object *obj = ptr;
> +	enum drm_gem_object_status s = 0;
> +	size_t add_size;
> +
> +	if (!obj)
> +		return 0;
> +
> +	add_size = (obj->funcs && obj->funcs->rss) ?
> +		obj->funcs->rss(obj) : obj->size;
> +
> +	if (obj->funcs && obj->funcs->status) {
> +		s = obj->funcs->status(obj);
> +		drm_data->supported_status |= s;
> +	}
> +
> +	if (drm_gem_object_is_shared_for_memory_stats(obj))
> +		drm_data->status.shared += obj->size;
> +	else
> +		drm_data->status.private += obj->size;
> +
> +	if (s & DRM_GEM_OBJECT_RESIDENT) {
> +		drm_data->status.resident += add_size;
> +	} else {
> +		/* If already purged or not yet backed by pages, don't
> +		 * count it as purgeable:
> +		 */
> +		s &= ~DRM_GEM_OBJECT_PURGEABLE;
> +	}
> +
> +	if (!dma_resv_test_signaled(obj->resv, dma_resv_usage_rw(true))) {
> +		drm_data->status.active += add_size;
> +		drm_data->supported_status |= DRM_GEM_OBJECT_ACTIVE;
> +
> +		/* If still active, don't count as purgeable: */
> +		s &= ~DRM_GEM_OBJECT_PURGEABLE;
> +	}
> +
> +	if (s & DRM_GEM_OBJECT_PURGEABLE)
> +		drm_data->status.purgeable += add_size;
> +
> +	return 0;
> +}
> +
>  /**
>   * drm_show_memory_stats - Helper to collect and show standard fdinfo memory stats
>   * @p: the printer to print output to
> @@ -902,50 +954,13 @@ EXPORT_SYMBOL(drm_print_memory_stats);
>   */
>  void drm_show_memory_stats(struct drm_printer *p, struct drm_file *file)
>  {
> -	struct drm_gem_object *obj;
> -	struct drm_memory_stats status = {};
> -	enum drm_gem_object_status supported_status = 0;
> -	int id;
> +	struct drm_bo_print_data data = {};
>  
>  	spin_lock(&file->table_lock);
> -	idr_for_each_entry (&file->object_idr, obj, id) {
> -		enum drm_gem_object_status s = 0;
> -		size_t add_size = (obj->funcs && obj->funcs->rss) ?
> -			obj->funcs->rss(obj) : obj->size;
> -
> -		if (obj->funcs && obj->funcs->status) {
> -			s = obj->funcs->status(obj);
> -			supported_status |= s;
> -		}
> -
> -		if (drm_gem_object_is_shared_for_memory_stats(obj))
> -			status.shared += obj->size;
> -		else
> -			status.private += obj->size;
> -
> -		if (s & DRM_GEM_OBJECT_RESIDENT) {
> -			status.resident += add_size;
> -		} else {
> -			/* If already purged or not yet backed by pages, don't
> -			 * count it as purgeable:
> -			 */
> -			s &= ~DRM_GEM_OBJECT_PURGEABLE;
> -		}
> -
> -		if (!dma_resv_test_signaled(obj->resv, dma_resv_usage_rw(true))) {
> -			status.active += add_size;
> -			supported_status |= DRM_GEM_OBJECT_ACTIVE;
> -
> -			/* If still active, don't count as purgeable: */
> -			s &= ~DRM_GEM_OBJECT_PURGEABLE;
> -		}
> -
> -		if (s & DRM_GEM_OBJECT_PURGEABLE)
> -			status.purgeable += add_size;
> -	}
> +	idr_for_each(&file->object_idr, &drm_bo_memory_stats, &data);
>  	spin_unlock(&file->table_lock);
>  
> -	drm_print_memory_stats(p, &status, supported_status, "memory");
> +	drm_print_memory_stats(p, &data.status, data.supported_status, "memory");
>  }
>  EXPORT_SYMBOL(drm_show_memory_stats);
>  
> -- 
> 2.49.0
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/8] drm/amdgpu: Add comments about drm_file.object_idr issues
  2025-05-28  9:13 ` [PATCH 6/8] drm/amdgpu: Add comments about drm_file.object_idr issues Simona Vetter
@ 2025-05-28  9:22   ` Simona Vetter
  0 siblings, 0 replies; 27+ messages in thread
From: Simona Vetter @ 2025-05-28  9:22 UTC (permalink / raw)
  To: DRI Development
  Cc: intel-xe, Simona Vetter, Alex Deucher, Christian König,
	Arvind Yadav, Shashank Sharma, Yunxiang Li, Frank Min,
	Kent Russell, Simona Vetter

On Wed, May 28, 2025 at 11:13:04AM +0200, Simona Vetter wrote:
> idr_for_each_entry() is fine, but will prematurely terminate on
> transient NULL entries. It should be switched over to idr_for_each,
> which allows you to handle this explicitly.
> 
> Note that transient NULL pointers in drm_file.object_idr have been a
> thing since f6cd7daecff5 ("drm: Release driver references to handle
> before making it available again"), this is a really old issue.
> 
> Since it's just a premature loop terminate the impact should be fairly
> benign, at least for any debugfs or fdinfo code.

Misread idr_get_next and I now think it should be fine as-is. Please
disregard this one.
-Sima

> 
> Aside: amdgpu_gem_force_release() looks questionable and should
> probably be revisited in the light of the revised hotunplug design
> we're aiming for. But that's an entirely separate can of worms.
> 
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: Arvind Yadav <Arvind.Yadav@amd.com>
> Cc: Shashank Sharma <shashank.sharma@amd.com>
> Cc: Simona Vetter <simona.vetter@ffwll.ch>
> Cc: Yunxiang Li <Yunxiang.Li@amd.com>
> Cc: Frank Min <Frank.Min@amd.com>
> Cc: Kent Russell <kent.russell@amd.com>
> Signed-off-by: Simona Vetter <simona.vetter@intel.com>
> Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> index 2c68118fe9fd..90723b13fa7d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> @@ -249,6 +249,7 @@ void amdgpu_gem_force_release(struct amdgpu_device *adev)
>  
>  		WARN_ONCE(1, "Still active user space clients!\n");
>  		spin_lock(&file->table_lock);
> +		/* FIXME: Use idr_for_each to handle transient NULL pointers */
>  		idr_for_each_entry(&file->object_idr, gobj, handle) {
>  			WARN_ONCE(1, "And also active allocations!\n");
>  			drm_gem_object_put(gobj);
> @@ -1167,6 +1168,7 @@ static int amdgpu_debugfs_gem_info_show(struct seq_file *m, void *unused)
>  		rcu_read_unlock();
>  
>  		spin_lock(&file->table_lock);
> +		/* FIXME: Use idr_for_each to handle transient NULL pointers */
>  		idr_for_each_entry(&file->object_idr, gobj, id) {
>  			struct amdgpu_bo *bo = gem_to_amdgpu_bo(gobj);
>  
> -- 
> 2.49.0
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 7/8] drm/vmwgfx: Add comments about drm_file.object_idr issues
  2025-05-28  9:13 ` [PATCH 7/8] drm/vmwgfx: " Simona Vetter
@ 2025-05-28  9:23   ` Simona Vetter
  0 siblings, 0 replies; 27+ messages in thread
From: Simona Vetter @ 2025-05-28  9:23 UTC (permalink / raw)
  To: DRI Development
  Cc: intel-xe, Simona Vetter, Simona Vetter, Zack Rusin,
	Broadcom internal kernel review list

On Wed, May 28, 2025 at 11:13:05AM +0200, Simona Vetter wrote:
> idr_for_each_entry() is fine, but will prematurely terminate on
> transient NULL entries. It should be switched over to idr_for_each,
> which allows you to handle this explicitly.
> 
> Note that transient NULL pointers in drm_file.object_idr have been a
> thing since f6cd7daecff5 ("drm: Release driver references to handle
> before making it available again"), this is a really old issue.
> 
> Since it's just a premature loop terminate the impact should be fairly
> benign, at least for any debugfs or fdinfo code.

Rereading idr_get_next I now think it's all fine, please disregard this
patch.
-Sima

> 
> Signed-off-by: Simona Vetter <simona.vetter@intel.com>
> Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
> Cc: Zack Rusin <zack.rusin@broadcom.com>
> Cc: Broadcom internal kernel review list <bcm-kernel-feedback-list@broadcom.com>
> ---
>  drivers/gpu/drm/vmwgfx/vmwgfx_gem.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c b/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c
> index c55382167c1b..438e40b92281 100644
> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c
> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_gem.c
> @@ -323,6 +323,7 @@ static int vmw_debugfs_gem_info_show(struct seq_file *m, void *unused)
>  		rcu_read_unlock();
>  
>  		spin_lock(&file->table_lock);
> +		/* FIXME: Use idr_for_each to handle transient NULL pointers */
>  		idr_for_each_entry(&file->object_idr, gobj, id) {
>  			struct vmw_bo *bo = to_vmw_bo(gobj);
>  
> -- 
> 2.49.0
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 8/8] drm/xe: Add comments about drm_file.object_idr issues
  2025-05-28  9:13 ` [PATCH 8/8] drm/xe: " Simona Vetter
@ 2025-05-28  9:24   ` Simona Vetter
  0 siblings, 0 replies; 27+ messages in thread
From: Simona Vetter @ 2025-05-28  9:24 UTC (permalink / raw)
  To: DRI Development
  Cc: intel-xe, Simona Vetter, Daniel Vetter, Lucas De Marchi,
	Thomas Hellström, Rodrigo Vivi

On Wed, May 28, 2025 at 11:13:06AM +0200, Simona Vetter wrote:
> idr_for_each_entry() is fine, but will prematurely terminate on
> transient NULL entries. It should be switched over to idr_for_each,
> which allows you to handle this explicitly.
> 
> Note that transient NULL pointers in drm_file.object_idr have been a
> thing since f6cd7daecff5 ("drm: Release driver references to handle
> before making it available again"), this is a really old issue.
> 
> Since it's just a premature loop terminate the impact should be fairly
> benign, at least for any debugfs or fdinfo code.
> 
> On top of that this code also drops the drm_file.table_lock lock while
> iterating, which can mess up the iterator state. And that's actually
> bad.

So I re-read idr_get_next and all that, and I think it should be all fine
- it handles both NULL entries and I think does recover from simply the
most recent id. Might miss some that have been concurrently added, but
that should be fine.

Sorry for the noise.
-Sima

> 
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: intel-xe@lists.freedesktop.org
> ---
>  drivers/gpu/drm/xe/xe_drm_client.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
> index 31f688e953d7..2542f265a221 100644
> --- a/drivers/gpu/drm/xe/xe_drm_client.c
> +++ b/drivers/gpu/drm/xe/xe_drm_client.c
> @@ -205,6 +205,7 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file)
>  
>  	/* Public objects. */
>  	spin_lock(&file->table_lock);
> +	/* FIXME: Use idr_for_each to handle transient NULL pointers */
>  	idr_for_each_entry(&file->object_idr, obj, id) {
>  		struct xe_bo *bo = gem_to_xe_bo(obj);
>  
> @@ -213,6 +214,8 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file)
>  			xe_bo_unlock(bo);
>  		} else {
>  			xe_bo_get(bo);
> +			/* FIXME: dropping the lock can mess the idr iterator
> +			 * state up */
>  			spin_unlock(&file->table_lock);
>  
>  			xe_bo_lock(bo, false);
> -- 
> 2.49.0
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/8] drm/gem: Fix race in drm_gem_handle_create_tail()
  2025-05-28  9:12 ` [PATCH 1/8] drm/gem: Fix race in drm_gem_handle_create_tail() Simona Vetter
@ 2025-05-28  9:26   ` Simona Vetter
  2025-05-28 13:20   ` Jacek Lawrynowicz
  2025-06-02 15:15   ` Thomas Zimmermann
  2 siblings, 0 replies; 27+ messages in thread
From: Simona Vetter @ 2025-05-28  9:26 UTC (permalink / raw)
  To: DRI Development
  Cc: intel-xe, Simona Vetter, Jacek Lawrynowicz, stable,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Simona Vetter

On Wed, May 28, 2025 at 11:12:59AM +0200, Simona Vetter wrote:
> Object creation is a careful dance where we must guarantee that the
> object is fully constructed before it is visible to other threads, and
> GEM buffer objects are no difference.
> 
> Final publishing happens by calling drm_gem_handle_create(). After
> that the only allowed thing to do is call drm_gem_object_put() because
> a concurrent call to the GEM_CLOSE ioctl with a correctly guessed id
> (which is trivial since we have a linear allocator) can already tear
> down the object again.
> 
> Luckily most drivers get this right, the very few exceptions I've
> pinged the relevant maintainers for. Unfortunately we also need
> drm_gem_handle_create() when creating additional handles for an
> already existing object (e.g. GETFB ioctl or the various bo import
> ioctl), and hence we cannot have a drm_gem_handle_create_and_put() as
> the only exported function to stop these issues from happening.
> 
> Now unfortunately the implementation of drm_gem_handle_create() isn't
> living up to standards: It does correctly finishe object
> initialization at the global level, and hence is safe against a
> concurrent tear down. But it also sets up the file-private aspects of
> the handle, and that part goes wrong: We fully register the object in
> the drm_file.object_idr before calling drm_vma_node_allow() or
> obj->funcs->open, which opens up races against concurrent removal of
> that handle in drm_gem_handle_delete().
> 
> Fix this with the usual two-stage approach of first reserving the
> handle id, and then only registering the object after we've completed
> the file-private setup.
> 
> Jacek reported this with a testcase of concurrently calling GEM_CLOSE
> on a freshly-created object (which also destroys the object), but it
> should be possible to hit this with just additional handles created
> through import or GETFB without completed destroying the underlying
> object with the concurrent GEM_CLOSE ioctl calls.
> 
> Note that the close-side of this race was fixed in f6cd7daecff5 ("drm:
> Release driver references to handle before making it available
> again"), which means a cool 9 years have passed until someone noticed
> that we need to make this symmetry or there's still gaps left :-/
> Without the 2-stage close approach we'd still have a race, therefore
> that's an integral part of this bugfix.
> 
> More importantly, this means we can have NULL pointers behind
> allocated id in our drm_file.object_idr. We need to check for that
> now:
> 
> - drm_gem_handle_delete() checks for ERR_OR_NULL already
> 
> - drm_gem.c:object_lookup() also chekcs for NULL
> 
> - drm_gem_release() should never be called if there's another thread
>   still existing that could call into an IOCTL that creates a new
>   handle, so cannot race. For paranoia I added a NULL check to
>   drm_gem_object_release_handle() though.
> 
> - most drivers (etnaviv, i915, msm) are find because they use
>   idr_find, which maps both ENOENT and NULL to NULL.
> 
> - vmgfx is already broken vmw_debugfs_gem_info_show() because NULL
>   pointers might exist due to drm_gem_handle_delete(). This needs a
>   separate patch. This is because idr_for_each_entry terminates on the
>   first NULL entry and so might not iterate over everything.
> 
> - similar for amd in amdgpu_debugfs_gem_info_show() and
>   amdgpu_gem_force_release(). The latter is really questionable though
>   since it's a best effort hack and there's no way to close all the
>   races. Needs separate patches.
> 
> - xe is really broken because it not uses idr_for_each_entry() but
>   also drops the drm_file.table_lock, which can wreak the idr iterator
>   state if you're unlucky enough. Maybe another reason to look into
>   the drm fdinfo memory stats instead of hand-rolling too much.
> 
> - drm_show_memory_stats() is also broken since it uses
>   idr_for_each_entry. But since that's a preexisting bug I'll follow
>   up with a separate patch.

I've already reworded the commit message locally since I now think
idr_for_each_entry is entirely fine.
-Sima

> 
> Reported-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
> Cc: stable@vger.kernel.org
> Cc: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Signed-off-by: Simona Vetter <simona.vetter@intel.com>
> Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
> ---
>  drivers/gpu/drm/drm_gem.c | 10 +++++++++-
>  include/drm/drm_file.h    |  3 +++
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> index 1e659d2660f7..e4e20dda47b1 100644
> --- a/drivers/gpu/drm/drm_gem.c
> +++ b/drivers/gpu/drm/drm_gem.c
> @@ -279,6 +279,9 @@ drm_gem_object_release_handle(int id, void *ptr, void *data)
>  	struct drm_file *file_priv = data;
>  	struct drm_gem_object *obj = ptr;
>  
> +	if (WARN_ON(!data))
> +		return 0;
> +
>  	if (obj->funcs->close)
>  		obj->funcs->close(obj, file_priv);
>  
> @@ -399,7 +402,7 @@ drm_gem_handle_create_tail(struct drm_file *file_priv,
>  	idr_preload(GFP_KERNEL);
>  	spin_lock(&file_priv->table_lock);
>  
> -	ret = idr_alloc(&file_priv->object_idr, obj, 1, 0, GFP_NOWAIT);
> +	ret = idr_alloc(&file_priv->object_idr, NULL, 1, 0, GFP_NOWAIT);
>  
>  	spin_unlock(&file_priv->table_lock);
>  	idr_preload_end();
> @@ -420,6 +423,11 @@ drm_gem_handle_create_tail(struct drm_file *file_priv,
>  			goto err_revoke;
>  	}
>  
> +	/* mirrors drm_gem_handle_delete to avoid races */
> +	spin_lock(&file_priv->table_lock);
> +	obj = idr_replace(&file_priv->object_idr, obj, handle);
> +	WARN_ON(obj != NULL);
> +	spin_unlock(&file_priv->table_lock);
>  	*handlep = handle;
>  	return 0;
>  
> diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
> index 5c3b2aa3e69d..d344d41e6cfe 100644
> --- a/include/drm/drm_file.h
> +++ b/include/drm/drm_file.h
> @@ -300,6 +300,9 @@ struct drm_file {
>  	 *
>  	 * Mapping of mm object handles to object pointers. Used by the GEM
>  	 * subsystem. Protected by @table_lock.
> +	 *
> +	 * Note that allocated entries might be NULL as a transient state when
> +	 * creating or deleting a handle.
>  	 */
>  	struct idr object_idr;
>  
> -- 
> 2.49.0
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/8] drm/gem: Fix race in drm_gem_handle_create_tail()
  2025-05-28  9:12 ` [PATCH 1/8] drm/gem: Fix race in drm_gem_handle_create_tail() Simona Vetter
  2025-05-28  9:26   ` Simona Vetter
@ 2025-05-28 13:20   ` Jacek Lawrynowicz
  2025-06-02 15:15   ` Thomas Zimmermann
  2 siblings, 0 replies; 27+ messages in thread
From: Jacek Lawrynowicz @ 2025-05-28 13:20 UTC (permalink / raw)
  To: Simona Vetter, DRI Development
  Cc: intel-xe, stable, Maarten Lankhorst, Maxime Ripard,
	Thomas Zimmermann, David Airlie, Simona Vetter, Simona Vetter

This fixes the race for me.

Tested-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>

On 5/28/2025 11:12 AM, Simona Vetter wrote:
> Object creation is a careful dance where we must guarantee that the
> object is fully constructed before it is visible to other threads, and
> GEM buffer objects are no difference.
> 
> Final publishing happens by calling drm_gem_handle_create(). After
> that the only allowed thing to do is call drm_gem_object_put() because
> a concurrent call to the GEM_CLOSE ioctl with a correctly guessed id
> (which is trivial since we have a linear allocator) can already tear
> down the object again.
> 
> Luckily most drivers get this right, the very few exceptions I've
> pinged the relevant maintainers for. Unfortunately we also need
> drm_gem_handle_create() when creating additional handles for an
> already existing object (e.g. GETFB ioctl or the various bo import
> ioctl), and hence we cannot have a drm_gem_handle_create_and_put() as
> the only exported function to stop these issues from happening.
> 
> Now unfortunately the implementation of drm_gem_handle_create() isn't
> living up to standards: It does correctly finishe object
> initialization at the global level, and hence is safe against a
> concurrent tear down. But it also sets up the file-private aspects of
> the handle, and that part goes wrong: We fully register the object in
> the drm_file.object_idr before calling drm_vma_node_allow() or
> obj->funcs->open, which opens up races against concurrent removal of
> that handle in drm_gem_handle_delete().
> 
> Fix this with the usual two-stage approach of first reserving the
> handle id, and then only registering the object after we've completed
> the file-private setup.
> 
> Jacek reported this with a testcase of concurrently calling GEM_CLOSE
> on a freshly-created object (which also destroys the object), but it
> should be possible to hit this with just additional handles created
> through import or GETFB without completed destroying the underlying
> object with the concurrent GEM_CLOSE ioctl calls.
> 
> Note that the close-side of this race was fixed in f6cd7daecff5 ("drm:
> Release driver references to handle before making it available
> again"), which means a cool 9 years have passed until someone noticed
> that we need to make this symmetry or there's still gaps left :-/
> Without the 2-stage close approach we'd still have a race, therefore
> that's an integral part of this bugfix.
> 
> More importantly, this means we can have NULL pointers behind
> allocated id in our drm_file.object_idr. We need to check for that
> now:
> 
> - drm_gem_handle_delete() checks for ERR_OR_NULL already
> 
> - drm_gem.c:object_lookup() also chekcs for NULL
> 
> - drm_gem_release() should never be called if there's another thread
>   still existing that could call into an IOCTL that creates a new
>   handle, so cannot race. For paranoia I added a NULL check to
>   drm_gem_object_release_handle() though.
> 
> - most drivers (etnaviv, i915, msm) are find because they use
>   idr_find, which maps both ENOENT and NULL to NULL.
> 
> - vmgfx is already broken vmw_debugfs_gem_info_show() because NULL
>   pointers might exist due to drm_gem_handle_delete(). This needs a
>   separate patch. This is because idr_for_each_entry terminates on the
>   first NULL entry and so might not iterate over everything.
> 
> - similar for amd in amdgpu_debugfs_gem_info_show() and
>   amdgpu_gem_force_release(). The latter is really questionable though
>   since it's a best effort hack and there's no way to close all the
>   races. Needs separate patches.
> 
> - xe is really broken because it not uses idr_for_each_entry() but
>   also drops the drm_file.table_lock, which can wreak the idr iterator
>   state if you're unlucky enough. Maybe another reason to look into
>   the drm fdinfo memory stats instead of hand-rolling too much.
> 
> - drm_show_memory_stats() is also broken since it uses
>   idr_for_each_entry. But since that's a preexisting bug I'll follow
>   up with a separate patch.
> 
> Reported-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
> Cc: stable@vger.kernel.org
> Cc: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Signed-off-by: Simona Vetter <simona.vetter@intel.com>
> Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
> ---
>  drivers/gpu/drm/drm_gem.c | 10 +++++++++-
>  include/drm/drm_file.h    |  3 +++
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> index 1e659d2660f7..e4e20dda47b1 100644
> --- a/drivers/gpu/drm/drm_gem.c
> +++ b/drivers/gpu/drm/drm_gem.c
> @@ -279,6 +279,9 @@ drm_gem_object_release_handle(int id, void *ptr, void *data)
>  	struct drm_file *file_priv = data;
>  	struct drm_gem_object *obj = ptr;
>  
> +	if (WARN_ON(!data))
> +		return 0;
> +
>  	if (obj->funcs->close)
>  		obj->funcs->close(obj, file_priv);
>  
> @@ -399,7 +402,7 @@ drm_gem_handle_create_tail(struct drm_file *file_priv,
>  	idr_preload(GFP_KERNEL);
>  	spin_lock(&file_priv->table_lock);
>  
> -	ret = idr_alloc(&file_priv->object_idr, obj, 1, 0, GFP_NOWAIT);
> +	ret = idr_alloc(&file_priv->object_idr, NULL, 1, 0, GFP_NOWAIT);
>  
>  	spin_unlock(&file_priv->table_lock);
>  	idr_preload_end();
> @@ -420,6 +423,11 @@ drm_gem_handle_create_tail(struct drm_file *file_priv,
>  			goto err_revoke;
>  	}
>  
> +	/* mirrors drm_gem_handle_delete to avoid races */
> +	spin_lock(&file_priv->table_lock);
> +	obj = idr_replace(&file_priv->object_idr, obj, handle);
> +	WARN_ON(obj != NULL);
> +	spin_unlock(&file_priv->table_lock);
>  	*handlep = handle;
>  	return 0;
>  
> diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
> index 5c3b2aa3e69d..d344d41e6cfe 100644
> --- a/include/drm/drm_file.h
> +++ b/include/drm/drm_file.h
> @@ -300,6 +300,9 @@ struct drm_file {
>  	 *
>  	 * Mapping of mm object handles to object pointers. Used by the GEM
>  	 * subsystem. Protected by @table_lock.
> +	 *
> +	 * Note that allocated entries might be NULL as a transient state when
> +	 * creating or deleting a handle.
>  	 */
>  	struct idr object_idr;
>  


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/8] accel/qaic: delete qaic_bo.handle
  2025-05-28  9:13 ` [PATCH 4/8] accel/qaic: delete qaic_bo.handle Simona Vetter
@ 2025-05-28 15:15   ` Jeff Hugo
  2025-06-02 14:43     ` Simona Vetter
  2025-06-06 16:25   ` Jeff Hugo
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff Hugo @ 2025-05-28 15:15 UTC (permalink / raw)
  To: Simona Vetter, DRI Development
  Cc: intel-xe, Carl Vanderlip, linux-arm-msm, Simona Vetter

On 5/28/2025 3:13 AM, Simona Vetter wrote:
> Handles are per-file, not global, so this makes no sense. Plus it's
> set only after calling drm_gem_handle_create(), and drivers are not
> allowed to further intialize a bo after that function has published it
> already.

intialize -> initialize

> It is also entirely unused, which helps enormously with removing it
> :-)

There is a downstream reference to it which hasn't quite made it 
upstream yet, but tweaking that should be fine. This is clearly a 
problem anyways, so we'll need to find a solution regardless. Thank you 
very much for the audit.

> Since we're still holding a reference to the bo nothing bad can
> happen, hence not cc: stable material.
> 
> Cc: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
> Cc: Carl Vanderlip <quic_carlv@quicinc.com>
> Cc: linux-arm-msm@vger.kernel.org
> Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
> Signed-off-by: Simona Vetter <simona.vetter@intel.com>

SOB chain seems weird to me. I got this email from @ffwll.ch, which 
would be the author. Where is @intel.com contributing to the handoff of 
the patch?

Overall, looks good to me. Seems like either I can ack this, and you can 
merge, or I can just take it forward. I have no preference.  Do you?

-Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/8] drm/fdinfo: Switch to idr_for_each() in drm_show_memory_stats()
  2025-05-28  9:13 ` [PATCH 2/8] drm/fdinfo: Switch to idr_for_each() in drm_show_memory_stats() Simona Vetter
  2025-05-28  9:22   ` Simona Vetter
@ 2025-05-28 20:10   ` kernel test robot
  1 sibling, 0 replies; 27+ messages in thread
From: kernel test robot @ 2025-05-28 20:10 UTC (permalink / raw)
  To: Simona Vetter, DRI Development
  Cc: llvm, oe-kbuild-all, intel-xe, Simona Vetter, Rob Clark,
	Emil Velikov, Tvrtko Ursulin, stable

Hi Simona,

kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20250527]
[also build test WARNING on linus/master v6.15]
[cannot apply to v6.15 v6.15-rc7 v6.15-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Simona-Vetter/drm-gem-Fix-race-in-drm_gem_handle_create_tail/20250528-171524
base:   next-20250527
patch link:    https://lore.kernel.org/r/20250528091307.1894940-3-simona.vetter%40ffwll.ch
patch subject: [PATCH 2/8] drm/fdinfo: Switch to idr_for_each() in drm_show_memory_stats()
config: riscv-randconfig-001-20250529 (https://download.01.org/0day-ci/archive/20250529/202505290334.GjoY9qsk-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project f819f46284f2a79790038e1f6649172789734ae8)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250529/202505290334.GjoY9qsk-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505290334.GjoY9qsk-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/gpu/drm/drm_file.c:916:3: warning: variable 'drm_data' is uninitialized when used here [-Wuninitialized]
     916 |                 drm_data->supported_status |= s;
         |                 ^~~~~~~~
   drivers/gpu/drm/drm_file.c:903:36: note: initialize the variable 'drm_data' to silence this warning
     903 |         struct drm_bo_print_data *drm_data;
         |                                           ^
         |                                            = NULL
   1 warning generated.


vim +/drm_data +916 drivers/gpu/drm/drm_file.c

   899	
   900	static int
   901	drm_bo_memory_stats(int id, void *ptr, void *data)
   902	{
   903		struct drm_bo_print_data *drm_data;
   904		struct drm_gem_object *obj = ptr;
   905		enum drm_gem_object_status s = 0;
   906		size_t add_size;
   907	
   908		if (!obj)
   909			return 0;
   910	
   911		add_size = (obj->funcs && obj->funcs->rss) ?
   912			obj->funcs->rss(obj) : obj->size;
   913	
   914		if (obj->funcs && obj->funcs->status) {
   915			s = obj->funcs->status(obj);
 > 916			drm_data->supported_status |= s;
   917		}
   918	
   919		if (drm_gem_object_is_shared_for_memory_stats(obj))
   920			drm_data->status.shared += obj->size;
   921		else
   922			drm_data->status.private += obj->size;
   923	
   924		if (s & DRM_GEM_OBJECT_RESIDENT) {
   925			drm_data->status.resident += add_size;
   926		} else {
   927			/* If already purged or not yet backed by pages, don't
   928			 * count it as purgeable:
   929			 */
   930			s &= ~DRM_GEM_OBJECT_PURGEABLE;
   931		}
   932	
   933		if (!dma_resv_test_signaled(obj->resv, dma_resv_usage_rw(true))) {
   934			drm_data->status.active += add_size;
   935			drm_data->supported_status |= DRM_GEM_OBJECT_ACTIVE;
   936	
   937			/* If still active, don't count as purgeable: */
   938			s &= ~DRM_GEM_OBJECT_PURGEABLE;
   939		}
   940	
   941		if (s & DRM_GEM_OBJECT_PURGEABLE)
   942			drm_data->status.purgeable += add_size;
   943	
   944		return 0;
   945	}
   946	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/8] drm/panthor: Fix UAF in panthor_gem_create_with_handle() debugfs code
  2025-05-28  9:13 ` [PATCH 3/8] drm/panthor: Fix UAF in panthor_gem_create_with_handle() debugfs code Simona Vetter
@ 2025-05-29 12:31   ` kernel test robot
  2025-06-01 14:06   ` Adrián Larumbe
  1 sibling, 0 replies; 27+ messages in thread
From: kernel test robot @ 2025-05-29 12:31 UTC (permalink / raw)
  To: Simona Vetter, DRI Development
  Cc: oe-kbuild-all, intel-xe, Simona Vetter, Adrián Larumbe,
	Boris Brezillon, Steven Price, Liviu Dudau

Hi Simona,

kernel test robot noticed the following build errors:

[auto build test ERROR on next-20250527]
[also build test ERROR on linus/master]
[cannot apply to v6.15 v6.15-rc7 v6.15-rc6 v6.15]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Simona-Vetter/drm-gem-Fix-race-in-drm_gem_handle_create_tail/20250528-171524
base:   next-20250527
patch link:    https://lore.kernel.org/r/20250528091307.1894940-4-simona.vetter%40ffwll.ch
patch subject: [PATCH 3/8] drm/panthor: Fix UAF in panthor_gem_create_with_handle() debugfs code
config: sparc-randconfig-r132-20250529 (https://download.01.org/0day-ci/archive/20250529/202505292016.42gSDa4w-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 8.5.0
reproduce: (https://download.01.org/0day-ci/archive/20250529/202505292016.42gSDa4w-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505292016.42gSDa4w-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/gpu/drm/panthor/panthor_gem.c: In function 'panthor_gem_create_object':
>> drivers/gpu/drm/panthor/panthor_gem.c:249:21: error: 'struct panthor_gem_object' has no member named 'debugfs'
     INIT_LIST_HEAD(&obj->debugfs.node);
                        ^~


vim +249 drivers/gpu/drm/panthor/panthor_gem.c

   225	
   226	/**
   227	 * panthor_gem_create_object - Implementation of driver->gem_create_object.
   228	 * @ddev: DRM device
   229	 * @size: Size in bytes of the memory the object will reference
   230	 *
   231	 * This lets the GEM helpers allocate object structs for us, and keep
   232	 * our BO stats correct.
   233	 */
   234	struct drm_gem_object *panthor_gem_create_object(struct drm_device *ddev, size_t size)
   235	{
   236		struct panthor_device *ptdev = container_of(ddev, struct panthor_device, base);
   237		struct panthor_gem_object *obj;
   238	
   239		obj = kzalloc(sizeof(*obj), GFP_KERNEL);
   240		if (!obj)
   241			return ERR_PTR(-ENOMEM);
   242	
   243		obj->base.base.funcs = &panthor_gem_funcs;
   244		obj->base.map_wc = !ptdev->coherent;
   245		mutex_init(&obj->gpuva_list_lock);
   246		drm_gem_gpuva_set_lock(&obj->base.base, &obj->gpuva_list_lock);
   247		mutex_init(&obj->label.lock);
   248	
 > 249		INIT_LIST_HEAD(&obj->debugfs.node);
   250	
   251		return &obj->base.base;
   252	}
   253	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/8] drm/panthor: Fix UAF in panthor_gem_create_with_handle() debugfs code
  2025-05-28  9:13 ` [PATCH 3/8] drm/panthor: Fix UAF in panthor_gem_create_with_handle() debugfs code Simona Vetter
  2025-05-29 12:31   ` kernel test robot
@ 2025-06-01 14:06   ` Adrián Larumbe
  2025-06-02 14:46     ` Simona Vetter
  1 sibling, 1 reply; 27+ messages in thread
From: Adrián Larumbe @ 2025-06-01 14:06 UTC (permalink / raw)
  To: Simona Vetter
  Cc: DRI Development, intel-xe, Boris Brezillon, Steven Price,
	Liviu Dudau, Simona Vetter

Hi Simona,

On 28.05.2025 11:13, Simona Vetter wrote:
> The object is potentially already gone after the drm_gem_object_put().
> In general the object should be fully constructed before calling
> drm_gem_handle_create(), except the debugfs tracking uses a separate
> lock and list and separate flag to denotate whether the object is
> actually initilized.
>
> Since I'm touching this all anyway simplify this by only adding the
> object to the debugfs when it's ready for that, which allows us to
> delete that separate flag. panthor_gem_debugfs_bo_rm() already checks
> whether we've actually been added to the list or this is some error
> path cleanup.
>
> Fixes: a3707f53eb3f ("drm/panthor: show device-wide list of DRM GEM objects over DebugFS")
> Cc: Adrián Larumbe <adrian.larumbe@collabora.com>
> Cc: Boris Brezillon <boris.brezillon@collabora.com>
> Cc: Steven Price <steven.price@arm.com>
> Cc: Liviu Dudau <liviu.dudau@arm.com>
> Signed-off-by: Simona Vetter <simona.vetter@intel.com>
> Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
> ---
>  drivers/gpu/drm/panthor/panthor_gem.c | 31 +++++++++++++--------------
>  drivers/gpu/drm/panthor/panthor_gem.h |  3 ---
>  2 files changed, 15 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
> index 7c00fd77758b..f334444cb5df 100644
> --- a/drivers/gpu/drm/panthor/panthor_gem.c
> +++ b/drivers/gpu/drm/panthor/panthor_gem.c
> @@ -16,9 +16,11 @@
>  #include "panthor_mmu.h"
>
>  #ifdef CONFIG_DEBUG_FS
> -static void panthor_gem_debugfs_bo_add(struct panthor_device *ptdev,
> -				       struct panthor_gem_object *bo)
> +static void panthor_gem_debugfs_bo_add(struct panthor_gem_object *bo)
>  {
> +	struct panthor_device *ptdev = container_of(bo->base.base.dev,
> +						    struct panthor_device, base);
> +
>  	INIT_LIST_HEAD(&bo->debugfs.node);
>
>  	bo->debugfs.creator.tgid = current->group_leader->pid;
> @@ -44,12 +46,10 @@ static void panthor_gem_debugfs_bo_rm(struct panthor_gem_object *bo)
>
>  static void panthor_gem_debugfs_set_usage_flags(struct panthor_gem_object *bo, u32 usage_flags)
>  {
> -	bo->debugfs.flags = usage_flags | PANTHOR_DEBUGFS_GEM_USAGE_FLAG_INITIALIZED;
> +	bo->debugfs.flags = usage_flags;
> +	panthor_gem_debugfs_bo_add(bo);
>  }
>  #else
> -static void panthor_gem_debugfs_bo_add(struct panthor_device *ptdev,
> -				       struct panthor_gem_object *bo)
> -{}
>  static void panthor_gem_debugfs_bo_rm(struct panthor_gem_object *bo) {}
>  static void panthor_gem_debugfs_set_usage_flags(struct panthor_gem_object *bo, u32 usage_flags) {}
>  #endif
> @@ -246,7 +246,7 @@ struct drm_gem_object *panthor_gem_create_object(struct drm_device *ddev, size_t
>  	drm_gem_gpuva_set_lock(&obj->base.base, &obj->gpuva_list_lock);
>  	mutex_init(&obj->label.lock);
>
> -	panthor_gem_debugfs_bo_add(ptdev, obj);
> +	INIT_LIST_HEAD(&obj->debugfs.node);

This is going to break builds with no DebugFS support.

>  	return &obj->base.base;
>  }
> @@ -285,6 +285,12 @@ panthor_gem_create_with_handle(struct drm_file *file,
>  		bo->base.base.resv = bo->exclusive_vm_root_gem->resv;
>  	}
>
> +	/*
> +	 * No explicit flags are needed in the call below, since the
> +	 * function internally sets the INITIALIZED bit for us.
> +	 */

If we got rid of the INITIALIZED usage flag, then this comment should also be reworded.

> +	panthor_gem_debugfs_set_usage_flags(bo, 0);
> +
>  	/*
>  	 * Allocate an id of idr table where the obj is registered
>  	 * and handle has the id what user can see.
> @@ -296,12 +302,6 @@ panthor_gem_create_with_handle(struct drm_file *file,
>  	/* drop reference from allocate - handle holds it now. */
>  	drm_gem_object_put(&shmem->base);
>
> -	/*
> -	 * No explicit flags are needed in the call below, since the
> -	 * function internally sets the INITIALIZED bit for us.
> -	 */
> -	panthor_gem_debugfs_set_usage_flags(bo, 0);
> -
>  	return ret;
>  }
>
> @@ -387,7 +387,7 @@ static void panthor_gem_debugfs_bo_print(struct panthor_gem_object *bo,
>  	unsigned int refcount = kref_read(&bo->base.base.refcount);
>  	char creator_info[32] = {};
>  	size_t resident_size;
> -	u32 gem_usage_flags = bo->debugfs.flags & (u32)~PANTHOR_DEBUGFS_GEM_USAGE_FLAG_INITIALIZED;
> +	u32 gem_usage_flags = bo->debugfs.flags;
>  	u32 gem_state_flags = 0;
>
>  	/* Skip BOs being destroyed. */
> @@ -436,8 +436,7 @@ void panthor_gem_debugfs_print_bos(struct panthor_device *ptdev,
>
>  	scoped_guard(mutex, &ptdev->gems.lock) {
>  		list_for_each_entry(bo, &ptdev->gems.node, debugfs.node) {
> -			if (bo->debugfs.flags & PANTHOR_DEBUGFS_GEM_USAGE_FLAG_INITIALIZED)
> -				panthor_gem_debugfs_bo_print(bo, m, &totals);
> +			panthor_gem_debugfs_bo_print(bo, m, &totals);
>  		}
>  	}
>
> diff --git a/drivers/gpu/drm/panthor/panthor_gem.h b/drivers/gpu/drm/panthor/panthor_gem.h
> index 4dd732dcd59f..8fc7215e9b90 100644
> --- a/drivers/gpu/drm/panthor/panthor_gem.h
> +++ b/drivers/gpu/drm/panthor/panthor_gem.h
> @@ -35,9 +35,6 @@ enum panthor_debugfs_gem_usage_flags {
>
>  	/** @PANTHOR_DEBUGFS_GEM_USAGE_FLAG_FW_MAPPED: BO is mapped on the FW VM. */
>  	PANTHOR_DEBUGFS_GEM_USAGE_FLAG_FW_MAPPED = BIT(PANTHOR_DEBUGFS_GEM_USAGE_FW_MAPPED_BIT),
> -
> -	/** @PANTHOR_DEBUGFS_GEM_USAGE_FLAG_INITIALIZED: BO is ready for DebugFS display. */
> -	PANTHOR_DEBUGFS_GEM_USAGE_FLAG_INITIALIZED = BIT(31),
>  };
>
>  /**
> --
> 2.49.0

There's a Panfrost port of the functionality this patch fixes pending merge into drm-misc,
so I should probably ask either Boris or Steven to hold off on merging them till I've made
sure there's no potential UAF in it.

Adrian Larumbe

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/8] accel/qaic: delete qaic_bo.handle
  2025-05-28 15:15   ` Jeff Hugo
@ 2025-06-02 14:43     ` Simona Vetter
  2025-06-03 14:43       ` Jeff Hugo
  0 siblings, 1 reply; 27+ messages in thread
From: Simona Vetter @ 2025-06-02 14:43 UTC (permalink / raw)
  To: Jeff Hugo
  Cc: Simona Vetter, DRI Development, intel-xe, Carl Vanderlip,
	linux-arm-msm, Simona Vetter

On Wed, May 28, 2025 at 09:15:22AM -0600, Jeff Hugo wrote:
> On 5/28/2025 3:13 AM, Simona Vetter wrote:
> > Handles are per-file, not global, so this makes no sense. Plus it's
> > set only after calling drm_gem_handle_create(), and drivers are not
> > allowed to further intialize a bo after that function has published it
> > already.
> 
> intialize -> initialize
> 
> > It is also entirely unused, which helps enormously with removing it
> > :-)
> 
> There is a downstream reference to it which hasn't quite made it upstream
> yet, but tweaking that should be fine. This is clearly a problem anyways, so
> we'll need to find a solution regardless. Thank you very much for the audit.
> 
> > Since we're still holding a reference to the bo nothing bad can
> > happen, hence not cc: stable material.
> > 
> > Cc: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
> > Cc: Carl Vanderlip <quic_carlv@quicinc.com>
> > Cc: linux-arm-msm@vger.kernel.org
> > Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
> > Signed-off-by: Simona Vetter <simona.vetter@intel.com>
> 
> SOB chain seems weird to me. I got this email from @ffwll.ch, which would be
> the author. Where is @intel.com contributing to the handoff of the patch?

I work for intel, so I just whack both of my emails on there for sob
purposes. The intel email tends to be a blackhole for public mail, which
is why I don't use it as From: for anything public.

> Overall, looks good to me. Seems like either I can ack this, and you can
> merge, or I can just take it forward. I have no preference.  Do you?

Whatever you like most, I'll resend the series with the wrong patches
dropped soon anyway.
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/8] drm/panthor: Fix UAF in panthor_gem_create_with_handle() debugfs code
  2025-06-01 14:06   ` Adrián Larumbe
@ 2025-06-02 14:46     ` Simona Vetter
  0 siblings, 0 replies; 27+ messages in thread
From: Simona Vetter @ 2025-06-02 14:46 UTC (permalink / raw)
  To: Adrián Larumbe
  Cc: Simona Vetter, DRI Development, intel-xe, Boris Brezillon,
	Steven Price, Liviu Dudau, Simona Vetter

On Sun, Jun 01, 2025 at 03:06:15PM +0100, Adrián Larumbe wrote:
> Hi Simona,
> 
> On 28.05.2025 11:13, Simona Vetter wrote:
> > The object is potentially already gone after the drm_gem_object_put().
> > In general the object should be fully constructed before calling
> > drm_gem_handle_create(), except the debugfs tracking uses a separate
> > lock and list and separate flag to denotate whether the object is
> > actually initilized.
> >
> > Since I'm touching this all anyway simplify this by only adding the
> > object to the debugfs when it's ready for that, which allows us to
> > delete that separate flag. panthor_gem_debugfs_bo_rm() already checks
> > whether we've actually been added to the list or this is some error
> > path cleanup.
> >
> > Fixes: a3707f53eb3f ("drm/panthor: show device-wide list of DRM GEM objects over DebugFS")
> > Cc: Adrián Larumbe <adrian.larumbe@collabora.com>
> > Cc: Boris Brezillon <boris.brezillon@collabora.com>
> > Cc: Steven Price <steven.price@arm.com>
> > Cc: Liviu Dudau <liviu.dudau@arm.com>
> > Signed-off-by: Simona Vetter <simona.vetter@intel.com>
> > Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
> > ---
> >  drivers/gpu/drm/panthor/panthor_gem.c | 31 +++++++++++++--------------
> >  drivers/gpu/drm/panthor/panthor_gem.h |  3 ---
> >  2 files changed, 15 insertions(+), 19 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/panthor/panthor_gem.c b/drivers/gpu/drm/panthor/panthor_gem.c
> > index 7c00fd77758b..f334444cb5df 100644
> > --- a/drivers/gpu/drm/panthor/panthor_gem.c
> > +++ b/drivers/gpu/drm/panthor/panthor_gem.c
> > @@ -16,9 +16,11 @@
> >  #include "panthor_mmu.h"
> >
> >  #ifdef CONFIG_DEBUG_FS
> > -static void panthor_gem_debugfs_bo_add(struct panthor_device *ptdev,
> > -				       struct panthor_gem_object *bo)
> > +static void panthor_gem_debugfs_bo_add(struct panthor_gem_object *bo)
> >  {
> > +	struct panthor_device *ptdev = container_of(bo->base.base.dev,
> > +						    struct panthor_device, base);
> > +
> >  	INIT_LIST_HEAD(&bo->debugfs.node);
> >
> >  	bo->debugfs.creator.tgid = current->group_leader->pid;
> > @@ -44,12 +46,10 @@ static void panthor_gem_debugfs_bo_rm(struct panthor_gem_object *bo)
> >
> >  static void panthor_gem_debugfs_set_usage_flags(struct panthor_gem_object *bo, u32 usage_flags)
> >  {
> > -	bo->debugfs.flags = usage_flags | PANTHOR_DEBUGFS_GEM_USAGE_FLAG_INITIALIZED;
> > +	bo->debugfs.flags = usage_flags;
> > +	panthor_gem_debugfs_bo_add(bo);
> >  }
> >  #else
> > -static void panthor_gem_debugfs_bo_add(struct panthor_device *ptdev,
> > -				       struct panthor_gem_object *bo)
> > -{}
> >  static void panthor_gem_debugfs_bo_rm(struct panthor_gem_object *bo) {}
> >  static void panthor_gem_debugfs_set_usage_flags(struct panthor_gem_object *bo, u32 usage_flags) {}
> >  #endif
> > @@ -246,7 +246,7 @@ struct drm_gem_object *panthor_gem_create_object(struct drm_device *ddev, size_t
> >  	drm_gem_gpuva_set_lock(&obj->base.base, &obj->gpuva_list_lock);
> >  	mutex_init(&obj->label.lock);
> >
> > -	panthor_gem_debugfs_bo_add(ptdev, obj);
> > +	INIT_LIST_HEAD(&obj->debugfs.node);
> 
> This is going to break builds with no DebugFS support.

Oops, forgot to build-test this. Note that runtime testing would be good,
I don't have the hw for that. Or can some CI pick this up somewhere?

> >  	return &obj->base.base;
> >  }
> > @@ -285,6 +285,12 @@ panthor_gem_create_with_handle(struct drm_file *file,
> >  		bo->base.base.resv = bo->exclusive_vm_root_gem->resv;
> >  	}
> >
> > +	/*
> > +	 * No explicit flags are needed in the call below, since the
> > +	 * function internally sets the INITIALIZED bit for us.
> > +	 */
> 
> If we got rid of the INITIALIZED usage flag, then this comment should also be reworded.

Will also fix this for v2.

> 
> > +	panthor_gem_debugfs_set_usage_flags(bo, 0);
> > +
> >  	/*
> >  	 * Allocate an id of idr table where the obj is registered
> >  	 * and handle has the id what user can see.
> > @@ -296,12 +302,6 @@ panthor_gem_create_with_handle(struct drm_file *file,
> >  	/* drop reference from allocate - handle holds it now. */
> >  	drm_gem_object_put(&shmem->base);
> >
> > -	/*
> > -	 * No explicit flags are needed in the call below, since the
> > -	 * function internally sets the INITIALIZED bit for us.
> > -	 */
> > -	panthor_gem_debugfs_set_usage_flags(bo, 0);
> > -
> >  	return ret;
> >  }
> >
> > @@ -387,7 +387,7 @@ static void panthor_gem_debugfs_bo_print(struct panthor_gem_object *bo,
> >  	unsigned int refcount = kref_read(&bo->base.base.refcount);
> >  	char creator_info[32] = {};
> >  	size_t resident_size;
> > -	u32 gem_usage_flags = bo->debugfs.flags & (u32)~PANTHOR_DEBUGFS_GEM_USAGE_FLAG_INITIALIZED;
> > +	u32 gem_usage_flags = bo->debugfs.flags;
> >  	u32 gem_state_flags = 0;
> >
> >  	/* Skip BOs being destroyed. */
> > @@ -436,8 +436,7 @@ void panthor_gem_debugfs_print_bos(struct panthor_device *ptdev,
> >
> >  	scoped_guard(mutex, &ptdev->gems.lock) {
> >  		list_for_each_entry(bo, &ptdev->gems.node, debugfs.node) {
> > -			if (bo->debugfs.flags & PANTHOR_DEBUGFS_GEM_USAGE_FLAG_INITIALIZED)
> > -				panthor_gem_debugfs_bo_print(bo, m, &totals);
> > +			panthor_gem_debugfs_bo_print(bo, m, &totals);
> >  		}
> >  	}
> >
> > diff --git a/drivers/gpu/drm/panthor/panthor_gem.h b/drivers/gpu/drm/panthor/panthor_gem.h
> > index 4dd732dcd59f..8fc7215e9b90 100644
> > --- a/drivers/gpu/drm/panthor/panthor_gem.h
> > +++ b/drivers/gpu/drm/panthor/panthor_gem.h
> > @@ -35,9 +35,6 @@ enum panthor_debugfs_gem_usage_flags {
> >
> >  	/** @PANTHOR_DEBUGFS_GEM_USAGE_FLAG_FW_MAPPED: BO is mapped on the FW VM. */
> >  	PANTHOR_DEBUGFS_GEM_USAGE_FLAG_FW_MAPPED = BIT(PANTHOR_DEBUGFS_GEM_USAGE_FW_MAPPED_BIT),
> > -
> > -	/** @PANTHOR_DEBUGFS_GEM_USAGE_FLAG_INITIALIZED: BO is ready for DebugFS display. */
> > -	PANTHOR_DEBUGFS_GEM_USAGE_FLAG_INITIALIZED = BIT(31),
> >  };
> >
> >  /**
> > --
> > 2.49.0
> 
> There's a Panfrost port of the functionality this patch fixes pending merge into drm-misc,
> so I should probably ask either Boris or Steven to hold off on merging them till I've made
> sure there's no potential UAF in it.

The important part is to do all init before the call to gem_object_put(),
that prevents the UAF. Doing all init before handle_create() is just nice
on top, since that aligns with the core design and avoids the need for a
separate init flag (for which you're at least missing the right memory
barriers here).

Thanks for your comments.
-Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/8] drm/gem: Fix race in drm_gem_handle_create_tail()
  2025-05-28  9:12 ` [PATCH 1/8] drm/gem: Fix race in drm_gem_handle_create_tail() Simona Vetter
  2025-05-28  9:26   ` Simona Vetter
  2025-05-28 13:20   ` Jacek Lawrynowicz
@ 2025-06-02 15:15   ` Thomas Zimmermann
  2025-06-03 11:45     ` Simona Vetter
  2 siblings, 1 reply; 27+ messages in thread
From: Thomas Zimmermann @ 2025-06-02 15:15 UTC (permalink / raw)
  To: Simona Vetter, DRI Development
  Cc: intel-xe, Jacek Lawrynowicz, stable, Maarten Lankhorst,
	Maxime Ripard, David Airlie, Simona Vetter, Simona Vetter

Hi

Am 28.05.25 um 11:12 schrieb Simona Vetter:
> Object creation is a careful dance where we must guarantee that the
> object is fully constructed before it is visible to other threads, and
> GEM buffer objects are no difference.
>
> Final publishing happens by calling drm_gem_handle_create(). After
> that the only allowed thing to do is call drm_gem_object_put() because
> a concurrent call to the GEM_CLOSE ioctl with a correctly guessed id
> (which is trivial since we have a linear allocator) can already tear
> down the object again.
>
> Luckily most drivers get this right, the very few exceptions I've
> pinged the relevant maintainers for. Unfortunately we also need
> drm_gem_handle_create() when creating additional handles for an
> already existing object (e.g. GETFB ioctl or the various bo import
> ioctl), and hence we cannot have a drm_gem_handle_create_and_put() as
> the only exported function to stop these issues from happening.
>
> Now unfortunately the implementation of drm_gem_handle_create() isn't
> living up to standards: It does correctly finishe object
> initialization at the global level, and hence is safe against a
> concurrent tear down. But it also sets up the file-private aspects of
> the handle, and that part goes wrong: We fully register the object in
> the drm_file.object_idr before calling drm_vma_node_allow() or
> obj->funcs->open, which opens up races against concurrent removal of
> that handle in drm_gem_handle_delete().
>
> Fix this with the usual two-stage approach of first reserving the
> handle id, and then only registering the object after we've completed
> the file-private setup.
>
> Jacek reported this with a testcase of concurrently calling GEM_CLOSE
> on a freshly-created object (which also destroys the object), but it
> should be possible to hit this with just additional handles created
> through import or GETFB without completed destroying the underlying
> object with the concurrent GEM_CLOSE ioctl calls.
>
> Note that the close-side of this race was fixed in f6cd7daecff5 ("drm:
> Release driver references to handle before making it available
> again"), which means a cool 9 years have passed until someone noticed
> that we need to make this symmetry or there's still gaps left :-/
> Without the 2-stage close approach we'd still have a race, therefore
> that's an integral part of this bugfix.
>
> More importantly, this means we can have NULL pointers behind
> allocated id in our drm_file.object_idr. We need to check for that
> now:
>
> - drm_gem_handle_delete() checks for ERR_OR_NULL already
>
> - drm_gem.c:object_lookup() also chekcs for NULL
>
> - drm_gem_release() should never be called if there's another thread
>    still existing that could call into an IOCTL that creates a new
>    handle, so cannot race. For paranoia I added a NULL check to
>    drm_gem_object_release_handle() though.
>
> - most drivers (etnaviv, i915, msm) are find because they use
>    idr_find, which maps both ENOENT and NULL to NULL.
>
> - vmgfx is already broken vmw_debugfs_gem_info_show() because NULL
>    pointers might exist due to drm_gem_handle_delete(). This needs a
>    separate patch. This is because idr_for_each_entry terminates on the
>    first NULL entry and so might not iterate over everything.
>
> - similar for amd in amdgpu_debugfs_gem_info_show() and
>    amdgpu_gem_force_release(). The latter is really questionable though
>    since it's a best effort hack and there's no way to close all the
>    races. Needs separate patches.
>
> - xe is really broken because it not uses idr_for_each_entry() but
>    also drops the drm_file.table_lock, which can wreak the idr iterator
>    state if you're unlucky enough. Maybe another reason to look into
>    the drm fdinfo memory stats instead of hand-rolling too much.
>
> - drm_show_memory_stats() is also broken since it uses
>    idr_for_each_entry. But since that's a preexisting bug I'll follow
>    up with a separate patch.
>
> Reported-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
> Cc: stable@vger.kernel.org
> Cc: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Signed-off-by: Simona Vetter <simona.vetter@intel.com>
> Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
> ---
>   drivers/gpu/drm/drm_gem.c | 10 +++++++++-
>   include/drm/drm_file.h    |  3 +++
>   2 files changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> index 1e659d2660f7..e4e20dda47b1 100644
> --- a/drivers/gpu/drm/drm_gem.c
> +++ b/drivers/gpu/drm/drm_gem.c
> @@ -279,6 +279,9 @@ drm_gem_object_release_handle(int id, void *ptr, void *data)
>   	struct drm_file *file_priv = data;
>   	struct drm_gem_object *obj = ptr;
>   
> +	if (WARN_ON(!data))
> +		return 0;
> +
>   	if (obj->funcs->close)
>   		obj->funcs->close(obj, file_priv);
>   
> @@ -399,7 +402,7 @@ drm_gem_handle_create_tail(struct drm_file *file_priv,
>   	idr_preload(GFP_KERNEL);
>   	spin_lock(&file_priv->table_lock);
>   
> -	ret = idr_alloc(&file_priv->object_idr, obj, 1, 0, GFP_NOWAIT);
> +	ret = idr_alloc(&file_priv->object_idr, NULL, 1, 0, GFP_NOWAIT);
>   
>   	spin_unlock(&file_priv->table_lock);
>   	idr_preload_end();
> @@ -420,6 +423,11 @@ drm_gem_handle_create_tail(struct drm_file *file_priv,
>   			goto err_revoke;
>   	}
>   
> +	/* mirrors drm_gem_handle_delete to avoid races */
> +	spin_lock(&file_priv->table_lock);
> +	obj = idr_replace(&file_priv->object_idr, obj, handle);
> +	WARN_ON(obj != NULL);

A DRM print function would be preferable. The obj here is an errno 
pointer. Should the errno code be part of the error message?

If it fails, why does the function still succeed?

Best regards
Thomas

> +	spin_unlock(&file_priv->table_lock);
>   	*handlep = handle;
>   	return 0;
>   
> diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
> index 5c3b2aa3e69d..d344d41e6cfe 100644
> --- a/include/drm/drm_file.h
> +++ b/include/drm/drm_file.h
> @@ -300,6 +300,9 @@ struct drm_file {
>   	 *
>   	 * Mapping of mm object handles to object pointers. Used by the GEM
>   	 * subsystem. Protected by @table_lock.
> +	 *
> +	 * Note that allocated entries might be NULL as a transient state when
> +	 * creating or deleting a handle.
>   	 */
>   	struct idr object_idr;
>   

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstrasse 146, 90461 Nuernberg, Germany
GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman
HRB 36809 (AG Nuernberg)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/8] drm/gem: Fix race in drm_gem_handle_create_tail()
  2025-06-02 15:15   ` Thomas Zimmermann
@ 2025-06-03 11:45     ` Simona Vetter
  2025-06-03 12:40       ` Thomas Zimmermann
  2025-06-04  9:02       ` Simona Vetter
  0 siblings, 2 replies; 27+ messages in thread
From: Simona Vetter @ 2025-06-03 11:45 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: Simona Vetter, DRI Development, intel-xe, Jacek Lawrynowicz,
	stable, Maarten Lankhorst, Maxime Ripard, David Airlie,
	Simona Vetter, Simona Vetter

On Mon, Jun 02, 2025 at 05:15:58PM +0200, Thomas Zimmermann wrote:
> Hi
> 
> Am 28.05.25 um 11:12 schrieb Simona Vetter:
> > Object creation is a careful dance where we must guarantee that the
> > object is fully constructed before it is visible to other threads, and
> > GEM buffer objects are no difference.
> > 
> > Final publishing happens by calling drm_gem_handle_create(). After
> > that the only allowed thing to do is call drm_gem_object_put() because
> > a concurrent call to the GEM_CLOSE ioctl with a correctly guessed id
> > (which is trivial since we have a linear allocator) can already tear
> > down the object again.
> > 
> > Luckily most drivers get this right, the very few exceptions I've
> > pinged the relevant maintainers for. Unfortunately we also need
> > drm_gem_handle_create() when creating additional handles for an
> > already existing object (e.g. GETFB ioctl or the various bo import
> > ioctl), and hence we cannot have a drm_gem_handle_create_and_put() as
> > the only exported function to stop these issues from happening.
> > 
> > Now unfortunately the implementation of drm_gem_handle_create() isn't
> > living up to standards: It does correctly finishe object
> > initialization at the global level, and hence is safe against a
> > concurrent tear down. But it also sets up the file-private aspects of
> > the handle, and that part goes wrong: We fully register the object in
> > the drm_file.object_idr before calling drm_vma_node_allow() or
> > obj->funcs->open, which opens up races against concurrent removal of
> > that handle in drm_gem_handle_delete().
> > 
> > Fix this with the usual two-stage approach of first reserving the
> > handle id, and then only registering the object after we've completed
> > the file-private setup.
> > 
> > Jacek reported this with a testcase of concurrently calling GEM_CLOSE
> > on a freshly-created object (which also destroys the object), but it
> > should be possible to hit this with just additional handles created
> > through import or GETFB without completed destroying the underlying
> > object with the concurrent GEM_CLOSE ioctl calls.
> > 
> > Note that the close-side of this race was fixed in f6cd7daecff5 ("drm:
> > Release driver references to handle before making it available
> > again"), which means a cool 9 years have passed until someone noticed
> > that we need to make this symmetry or there's still gaps left :-/
> > Without the 2-stage close approach we'd still have a race, therefore
> > that's an integral part of this bugfix.
> > 
> > More importantly, this means we can have NULL pointers behind
> > allocated id in our drm_file.object_idr. We need to check for that
> > now:
> > 
> > - drm_gem_handle_delete() checks for ERR_OR_NULL already
> > 
> > - drm_gem.c:object_lookup() also chekcs for NULL
> > 
> > - drm_gem_release() should never be called if there's another thread
> >    still existing that could call into an IOCTL that creates a new
> >    handle, so cannot race. For paranoia I added a NULL check to
> >    drm_gem_object_release_handle() though.
> > 
> > - most drivers (etnaviv, i915, msm) are find because they use
> >    idr_find, which maps both ENOENT and NULL to NULL.
> > 
> > - vmgfx is already broken vmw_debugfs_gem_info_show() because NULL
> >    pointers might exist due to drm_gem_handle_delete(). This needs a
> >    separate patch. This is because idr_for_each_entry terminates on the
> >    first NULL entry and so might not iterate over everything.
> > 
> > - similar for amd in amdgpu_debugfs_gem_info_show() and
> >    amdgpu_gem_force_release(). The latter is really questionable though
> >    since it's a best effort hack and there's no way to close all the
> >    races. Needs separate patches.
> > 
> > - xe is really broken because it not uses idr_for_each_entry() but
> >    also drops the drm_file.table_lock, which can wreak the idr iterator
> >    state if you're unlucky enough. Maybe another reason to look into
> >    the drm fdinfo memory stats instead of hand-rolling too much.
> > 
> > - drm_show_memory_stats() is also broken since it uses
> >    idr_for_each_entry. But since that's a preexisting bug I'll follow
> >    up with a separate patch.
> > 
> > Reported-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
> > Cc: stable@vger.kernel.org
> > Cc: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
> > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > Cc: Maxime Ripard <mripard@kernel.org>
> > Cc: Thomas Zimmermann <tzimmermann@suse.de>
> > Cc: David Airlie <airlied@gmail.com>
> > Cc: Simona Vetter <simona@ffwll.ch>
> > Signed-off-by: Simona Vetter <simona.vetter@intel.com>
> > Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
> > ---
> >   drivers/gpu/drm/drm_gem.c | 10 +++++++++-
> >   include/drm/drm_file.h    |  3 +++
> >   2 files changed, 12 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> > index 1e659d2660f7..e4e20dda47b1 100644
> > --- a/drivers/gpu/drm/drm_gem.c
> > +++ b/drivers/gpu/drm/drm_gem.c
> > @@ -279,6 +279,9 @@ drm_gem_object_release_handle(int id, void *ptr, void *data)
> >   	struct drm_file *file_priv = data;
> >   	struct drm_gem_object *obj = ptr;
> > +	if (WARN_ON(!data))
> > +		return 0;
> > +
> >   	if (obj->funcs->close)
> >   		obj->funcs->close(obj, file_priv);
> > @@ -399,7 +402,7 @@ drm_gem_handle_create_tail(struct drm_file *file_priv,
> >   	idr_preload(GFP_KERNEL);
> >   	spin_lock(&file_priv->table_lock);
> > -	ret = idr_alloc(&file_priv->object_idr, obj, 1, 0, GFP_NOWAIT);
> > +	ret = idr_alloc(&file_priv->object_idr, NULL, 1, 0, GFP_NOWAIT);
> >   	spin_unlock(&file_priv->table_lock);
> >   	idr_preload_end();
> > @@ -420,6 +423,11 @@ drm_gem_handle_create_tail(struct drm_file *file_priv,
> >   			goto err_revoke;
> >   	}
> > +	/* mirrors drm_gem_handle_delete to avoid races */
> > +	spin_lock(&file_priv->table_lock);
> > +	obj = idr_replace(&file_priv->object_idr, obj, handle);
> > +	WARN_ON(obj != NULL);
> 
> A DRM print function would be preferable. The obj here is an errno pointer.
> Should the errno code be part of the error message?
> 
> If it fails, why does the function still succeed?

This is an internal error that should never happen, at that point just
bailing out is the way to go.

Also note that the error code here is just to satisfy the function
signature that id_for_each expects, we don't look at it ever (since if
there's no bugs, it should never fail). I learned this because I actually
removed the int return value and stuff didn't compile :-)

I can use drm_WARN_ON if you want me to though?

I'll also explain this in the commit message for the next round.
-Sima

> 
> Best regards
> Thomas
> 
> > +	spin_unlock(&file_priv->table_lock);
> >   	*handlep = handle;
> >   	return 0;
> > diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
> > index 5c3b2aa3e69d..d344d41e6cfe 100644
> > --- a/include/drm/drm_file.h
> > +++ b/include/drm/drm_file.h
> > @@ -300,6 +300,9 @@ struct drm_file {
> >   	 *
> >   	 * Mapping of mm object handles to object pointers. Used by the GEM
> >   	 * subsystem. Protected by @table_lock.
> > +	 *
> > +	 * Note that allocated entries might be NULL as a transient state when
> > +	 * creating or deleting a handle.
> >   	 */
> >   	struct idr object_idr;
> 
> -- 
> --
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Software Solutions Germany GmbH
> Frankenstrasse 146, 90461 Nuernberg, Germany
> GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman
> HRB 36809 (AG Nuernberg)
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/8] drm/gem: Fix race in drm_gem_handle_create_tail()
  2025-06-03 11:45     ` Simona Vetter
@ 2025-06-03 12:40       ` Thomas Zimmermann
  2025-06-04  9:02       ` Simona Vetter
  1 sibling, 0 replies; 27+ messages in thread
From: Thomas Zimmermann @ 2025-06-03 12:40 UTC (permalink / raw)
  To: Simona Vetter
  Cc: DRI Development, intel-xe, Jacek Lawrynowicz, stable,
	Maarten Lankhorst, Maxime Ripard, David Airlie, Simona Vetter,
	Simona Vetter

Hi

Am 03.06.25 um 13:45 schrieb Simona Vetter:
> On Mon, Jun 02, 2025 at 05:15:58PM +0200, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 28.05.25 um 11:12 schrieb Simona Vetter:
>>> Object creation is a careful dance where we must guarantee that the
>>> object is fully constructed before it is visible to other threads, and
>>> GEM buffer objects are no difference.
>>>
>>> Final publishing happens by calling drm_gem_handle_create(). After
>>> that the only allowed thing to do is call drm_gem_object_put() because
>>> a concurrent call to the GEM_CLOSE ioctl with a correctly guessed id
>>> (which is trivial since we have a linear allocator) can already tear
>>> down the object again.
>>>
>>> Luckily most drivers get this right, the very few exceptions I've
>>> pinged the relevant maintainers for. Unfortunately we also need
>>> drm_gem_handle_create() when creating additional handles for an
>>> already existing object (e.g. GETFB ioctl or the various bo import
>>> ioctl), and hence we cannot have a drm_gem_handle_create_and_put() as
>>> the only exported function to stop these issues from happening.
>>>
>>> Now unfortunately the implementation of drm_gem_handle_create() isn't
>>> living up to standards: It does correctly finishe object
>>> initialization at the global level, and hence is safe against a
>>> concurrent tear down. But it also sets up the file-private aspects of
>>> the handle, and that part goes wrong: We fully register the object in
>>> the drm_file.object_idr before calling drm_vma_node_allow() or
>>> obj->funcs->open, which opens up races against concurrent removal of
>>> that handle in drm_gem_handle_delete().
>>>
>>> Fix this with the usual two-stage approach of first reserving the
>>> handle id, and then only registering the object after we've completed
>>> the file-private setup.
>>>
>>> Jacek reported this with a testcase of concurrently calling GEM_CLOSE
>>> on a freshly-created object (which also destroys the object), but it
>>> should be possible to hit this with just additional handles created
>>> through import or GETFB without completed destroying the underlying
>>> object with the concurrent GEM_CLOSE ioctl calls.
>>>
>>> Note that the close-side of this race was fixed in f6cd7daecff5 ("drm:
>>> Release driver references to handle before making it available
>>> again"), which means a cool 9 years have passed until someone noticed
>>> that we need to make this symmetry or there's still gaps left :-/
>>> Without the 2-stage close approach we'd still have a race, therefore
>>> that's an integral part of this bugfix.
>>>
>>> More importantly, this means we can have NULL pointers behind
>>> allocated id in our drm_file.object_idr. We need to check for that
>>> now:
>>>
>>> - drm_gem_handle_delete() checks for ERR_OR_NULL already
>>>
>>> - drm_gem.c:object_lookup() also chekcs for NULL
>>>
>>> - drm_gem_release() should never be called if there's another thread
>>>     still existing that could call into an IOCTL that creates a new
>>>     handle, so cannot race. For paranoia I added a NULL check to
>>>     drm_gem_object_release_handle() though.
>>>
>>> - most drivers (etnaviv, i915, msm) are find because they use
>>>     idr_find, which maps both ENOENT and NULL to NULL.
>>>
>>> - vmgfx is already broken vmw_debugfs_gem_info_show() because NULL
>>>     pointers might exist due to drm_gem_handle_delete(). This needs a
>>>     separate patch. This is because idr_for_each_entry terminates on the
>>>     first NULL entry and so might not iterate over everything.
>>>
>>> - similar for amd in amdgpu_debugfs_gem_info_show() and
>>>     amdgpu_gem_force_release(). The latter is really questionable though
>>>     since it's a best effort hack and there's no way to close all the
>>>     races. Needs separate patches.
>>>
>>> - xe is really broken because it not uses idr_for_each_entry() but
>>>     also drops the drm_file.table_lock, which can wreak the idr iterator
>>>     state if you're unlucky enough. Maybe another reason to look into
>>>     the drm fdinfo memory stats instead of hand-rolling too much.
>>>
>>> - drm_show_memory_stats() is also broken since it uses
>>>     idr_for_each_entry. But since that's a preexisting bug I'll follow
>>>     up with a separate patch.
>>>
>>> Reported-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
>>> Cc: stable@vger.kernel.org
>>> Cc: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
>>> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
>>> Cc: Maxime Ripard <mripard@kernel.org>
>>> Cc: Thomas Zimmermann <tzimmermann@suse.de>
>>> Cc: David Airlie <airlied@gmail.com>
>>> Cc: Simona Vetter <simona@ffwll.ch>
>>> Signed-off-by: Simona Vetter <simona.vetter@intel.com>
>>> Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
>>> ---
>>>    drivers/gpu/drm/drm_gem.c | 10 +++++++++-
>>>    include/drm/drm_file.h    |  3 +++
>>>    2 files changed, 12 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
>>> index 1e659d2660f7..e4e20dda47b1 100644
>>> --- a/drivers/gpu/drm/drm_gem.c
>>> +++ b/drivers/gpu/drm/drm_gem.c
>>> @@ -279,6 +279,9 @@ drm_gem_object_release_handle(int id, void *ptr, void *data)
>>>    	struct drm_file *file_priv = data;
>>>    	struct drm_gem_object *obj = ptr;
>>> +	if (WARN_ON(!data))
>>> +		return 0;
>>> +
>>>    	if (obj->funcs->close)
>>>    		obj->funcs->close(obj, file_priv);
>>> @@ -399,7 +402,7 @@ drm_gem_handle_create_tail(struct drm_file *file_priv,
>>>    	idr_preload(GFP_KERNEL);
>>>    	spin_lock(&file_priv->table_lock);
>>> -	ret = idr_alloc(&file_priv->object_idr, obj, 1, 0, GFP_NOWAIT);
>>> +	ret = idr_alloc(&file_priv->object_idr, NULL, 1, 0, GFP_NOWAIT);
>>>    	spin_unlock(&file_priv->table_lock);
>>>    	idr_preload_end();
>>> @@ -420,6 +423,11 @@ drm_gem_handle_create_tail(struct drm_file *file_priv,
>>>    			goto err_revoke;
>>>    	}
>>> +	/* mirrors drm_gem_handle_delete to avoid races */
>>> +	spin_lock(&file_priv->table_lock);
>>> +	obj = idr_replace(&file_priv->object_idr, obj, handle);
>>> +	WARN_ON(obj != NULL);
>> A DRM print function would be preferable. The obj here is an errno pointer.
>> Should the errno code be part of the error message?
>>
>> If it fails, why does the function still succeed?
> This is an internal error that should never happen, at that point just
> bailing out is the way to go.
>
> Also note that the error code here is just to satisfy the function
> signature that id_for_each expects, we don't look at it ever (since if
> there's no bugs, it should never fail). I learned this because I actually
> removed the int return value and stuff didn't compile :-)

I see.

>
> I can use drm_WARN_ON if you want me to though?

If you use drm_WARN_ON, you can add

Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>

Best regards
Thomas

>
> I'll also explain this in the commit message for the next round.
> -Sima
>
>> Best regards
>> Thomas
>>
>>> +	spin_unlock(&file_priv->table_lock);
>>>    	*handlep = handle;
>>>    	return 0;
>>> diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
>>> index 5c3b2aa3e69d..d344d41e6cfe 100644
>>> --- a/include/drm/drm_file.h
>>> +++ b/include/drm/drm_file.h
>>> @@ -300,6 +300,9 @@ struct drm_file {
>>>    	 *
>>>    	 * Mapping of mm object handles to object pointers. Used by the GEM
>>>    	 * subsystem. Protected by @table_lock.
>>> +	 *
>>> +	 * Note that allocated entries might be NULL as a transient state when
>>> +	 * creating or deleting a handle.
>>>    	 */
>>>    	struct idr object_idr;
>> -- 
>> --
>> Thomas Zimmermann
>> Graphics Driver Developer
>> SUSE Software Solutions Germany GmbH
>> Frankenstrasse 146, 90461 Nuernberg, Germany
>> GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman
>> HRB 36809 (AG Nuernberg)
>>

-- 
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Frankenstrasse 146, 90461 Nuernberg, Germany
GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman
HRB 36809 (AG Nuernberg)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/8] accel/qaic: delete qaic_bo.handle
  2025-06-02 14:43     ` Simona Vetter
@ 2025-06-03 14:43       ` Jeff Hugo
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff Hugo @ 2025-06-03 14:43 UTC (permalink / raw)
  To: Simona Vetter
  Cc: DRI Development, intel-xe, Carl Vanderlip, linux-arm-msm,
	Simona Vetter

On 6/2/2025 8:43 AM, Simona Vetter wrote:
> On Wed, May 28, 2025 at 09:15:22AM -0600, Jeff Hugo wrote:
>> On 5/28/2025 3:13 AM, Simona Vetter wrote:
>>> Handles are per-file, not global, so this makes no sense. Plus it's
>>> set only after calling drm_gem_handle_create(), and drivers are not
>>> allowed to further intialize a bo after that function has published it
>>> already.
>>
>> intialize -> initialize
>>
>>> It is also entirely unused, which helps enormously with removing it
>>> :-)
>>
>> There is a downstream reference to it which hasn't quite made it upstream
>> yet, but tweaking that should be fine. This is clearly a problem anyways, so
>> we'll need to find a solution regardless. Thank you very much for the audit.
>>
>>> Since we're still holding a reference to the bo nothing bad can
>>> happen, hence not cc: stable material.
>>>
>>> Cc: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
>>> Cc: Carl Vanderlip <quic_carlv@quicinc.com>
>>> Cc: linux-arm-msm@vger.kernel.org
>>> Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
>>> Signed-off-by: Simona Vetter <simona.vetter@intel.com>
>>
>> SOB chain seems weird to me. I got this email from @ffwll.ch, which would be
>> the author. Where is @intel.com contributing to the handoff of the patch?
> 
> I work for intel, so I just whack both of my emails on there for sob
> purposes. The intel email tends to be a blackhole for public mail, which
> is why I don't use it as From: for anything public.>
>> Overall, looks good to me. Seems like either I can ack this, and you can
>> merge, or I can just take it forward. I have no preference.  Do you?
> 
> Whatever you like most, I'll resend the series with the wrong patches
> dropped soon anyway.

I think I'll apply it this week, with the typo fixed. Then I can 
mentally check it off my list as closed.

Reviewed-by: Jeff Hugo <jeff.hugo@oss.qualcomm.com>

-Jeff


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/8] drm/gem: Fix race in drm_gem_handle_create_tail()
  2025-06-03 11:45     ` Simona Vetter
  2025-06-03 12:40       ` Thomas Zimmermann
@ 2025-06-04  9:02       ` Simona Vetter
  1 sibling, 0 replies; 27+ messages in thread
From: Simona Vetter @ 2025-06-04  9:02 UTC (permalink / raw)
  To: Thomas Zimmermann
  Cc: Simona Vetter, DRI Development, intel-xe, Jacek Lawrynowicz,
	stable, Maarten Lankhorst, Maxime Ripard, David Airlie,
	Simona Vetter, Simona Vetter

On Tue, Jun 03, 2025 at 01:45:54PM +0200, Simona Vetter wrote:
> On Mon, Jun 02, 2025 at 05:15:58PM +0200, Thomas Zimmermann wrote:
> > Hi
> > 
> > Am 28.05.25 um 11:12 schrieb Simona Vetter:
> > > Object creation is a careful dance where we must guarantee that the
> > > object is fully constructed before it is visible to other threads, and
> > > GEM buffer objects are no difference.
> > > 
> > > Final publishing happens by calling drm_gem_handle_create(). After
> > > that the only allowed thing to do is call drm_gem_object_put() because
> > > a concurrent call to the GEM_CLOSE ioctl with a correctly guessed id
> > > (which is trivial since we have a linear allocator) can already tear
> > > down the object again.
> > > 
> > > Luckily most drivers get this right, the very few exceptions I've
> > > pinged the relevant maintainers for. Unfortunately we also need
> > > drm_gem_handle_create() when creating additional handles for an
> > > already existing object (e.g. GETFB ioctl or the various bo import
> > > ioctl), and hence we cannot have a drm_gem_handle_create_and_put() as
> > > the only exported function to stop these issues from happening.
> > > 
> > > Now unfortunately the implementation of drm_gem_handle_create() isn't
> > > living up to standards: It does correctly finishe object
> > > initialization at the global level, and hence is safe against a
> > > concurrent tear down. But it also sets up the file-private aspects of
> > > the handle, and that part goes wrong: We fully register the object in
> > > the drm_file.object_idr before calling drm_vma_node_allow() or
> > > obj->funcs->open, which opens up races against concurrent removal of
> > > that handle in drm_gem_handle_delete().
> > > 
> > > Fix this with the usual two-stage approach of first reserving the
> > > handle id, and then only registering the object after we've completed
> > > the file-private setup.
> > > 
> > > Jacek reported this with a testcase of concurrently calling GEM_CLOSE
> > > on a freshly-created object (which also destroys the object), but it
> > > should be possible to hit this with just additional handles created
> > > through import or GETFB without completed destroying the underlying
> > > object with the concurrent GEM_CLOSE ioctl calls.
> > > 
> > > Note that the close-side of this race was fixed in f6cd7daecff5 ("drm:
> > > Release driver references to handle before making it available
> > > again"), which means a cool 9 years have passed until someone noticed
> > > that we need to make this symmetry or there's still gaps left :-/
> > > Without the 2-stage close approach we'd still have a race, therefore
> > > that's an integral part of this bugfix.
> > > 
> > > More importantly, this means we can have NULL pointers behind
> > > allocated id in our drm_file.object_idr. We need to check for that
> > > now:
> > > 
> > > - drm_gem_handle_delete() checks for ERR_OR_NULL already
> > > 
> > > - drm_gem.c:object_lookup() also chekcs for NULL
> > > 
> > > - drm_gem_release() should never be called if there's another thread
> > >    still existing that could call into an IOCTL that creates a new
> > >    handle, so cannot race. For paranoia I added a NULL check to
> > >    drm_gem_object_release_handle() though.
> > > 
> > > - most drivers (etnaviv, i915, msm) are find because they use
> > >    idr_find, which maps both ENOENT and NULL to NULL.
> > > 
> > > - vmgfx is already broken vmw_debugfs_gem_info_show() because NULL
> > >    pointers might exist due to drm_gem_handle_delete(). This needs a
> > >    separate patch. This is because idr_for_each_entry terminates on the
> > >    first NULL entry and so might not iterate over everything.
> > > 
> > > - similar for amd in amdgpu_debugfs_gem_info_show() and
> > >    amdgpu_gem_force_release(). The latter is really questionable though
> > >    since it's a best effort hack and there's no way to close all the
> > >    races. Needs separate patches.
> > > 
> > > - xe is really broken because it not uses idr_for_each_entry() but
> > >    also drops the drm_file.table_lock, which can wreak the idr iterator
> > >    state if you're unlucky enough. Maybe another reason to look into
> > >    the drm fdinfo memory stats instead of hand-rolling too much.
> > > 
> > > - drm_show_memory_stats() is also broken since it uses
> > >    idr_for_each_entry. But since that's a preexisting bug I'll follow
> > >    up with a separate patch.
> > > 
> > > Reported-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
> > > Cc: stable@vger.kernel.org
> > > Cc: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com>
> > > Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> > > Cc: Maxime Ripard <mripard@kernel.org>
> > > Cc: Thomas Zimmermann <tzimmermann@suse.de>
> > > Cc: David Airlie <airlied@gmail.com>
> > > Cc: Simona Vetter <simona@ffwll.ch>
> > > Signed-off-by: Simona Vetter <simona.vetter@intel.com>
> > > Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
> > > ---
> > >   drivers/gpu/drm/drm_gem.c | 10 +++++++++-
> > >   include/drm/drm_file.h    |  3 +++
> > >   2 files changed, 12 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> > > index 1e659d2660f7..e4e20dda47b1 100644
> > > --- a/drivers/gpu/drm/drm_gem.c
> > > +++ b/drivers/gpu/drm/drm_gem.c
> > > @@ -279,6 +279,9 @@ drm_gem_object_release_handle(int id, void *ptr, void *data)
> > >   	struct drm_file *file_priv = data;
> > >   	struct drm_gem_object *obj = ptr;
> > > +	if (WARN_ON(!data))
> > > +		return 0;
> > > +
> > >   	if (obj->funcs->close)
> > >   		obj->funcs->close(obj, file_priv);
> > > @@ -399,7 +402,7 @@ drm_gem_handle_create_tail(struct drm_file *file_priv,
> > >   	idr_preload(GFP_KERNEL);
> > >   	spin_lock(&file_priv->table_lock);
> > > -	ret = idr_alloc(&file_priv->object_idr, obj, 1, 0, GFP_NOWAIT);
> > > +	ret = idr_alloc(&file_priv->object_idr, NULL, 1, 0, GFP_NOWAIT);
> > >   	spin_unlock(&file_priv->table_lock);
> > >   	idr_preload_end();
> > > @@ -420,6 +423,11 @@ drm_gem_handle_create_tail(struct drm_file *file_priv,
> > >   			goto err_revoke;
> > >   	}
> > > +	/* mirrors drm_gem_handle_delete to avoid races */
> > > +	spin_lock(&file_priv->table_lock);
> > > +	obj = idr_replace(&file_priv->object_idr, obj, handle);
> > > +	WARN_ON(obj != NULL);
> > 
> > A DRM print function would be preferable. The obj here is an errno pointer.
> > Should the errno code be part of the error message?
> > 
> > If it fails, why does the function still succeed?
> 
> This is an internal error that should never happen, at that point just
> bailing out is the way to go.
> 
> Also note that the error code here is just to satisfy the function
> signature that id_for_each expects, we don't look at it ever (since if
> there's no bugs, it should never fail). I learned this because I actually
> removed the int return value and stuff didn't compile :-)

Ok this part was nonsense, I mixed it up with handle_delete(). I still
don't think we should return an error code here, because we've
successfully installed the handle. It's just that something happened with
the idr that should be impossible, so all bets are off.
-Sima

> I can use drm_WARN_ON if you want me to though?
> 
> I'll also explain this in the commit message for the next round.
> -Sima
> 
> > 
> > Best regards
> > Thomas
> > 
> > > +	spin_unlock(&file_priv->table_lock);
> > >   	*handlep = handle;
> > >   	return 0;
> > > diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
> > > index 5c3b2aa3e69d..d344d41e6cfe 100644
> > > --- a/include/drm/drm_file.h
> > > +++ b/include/drm/drm_file.h
> > > @@ -300,6 +300,9 @@ struct drm_file {
> > >   	 *
> > >   	 * Mapping of mm object handles to object pointers. Used by the GEM
> > >   	 * subsystem. Protected by @table_lock.
> > > +	 *
> > > +	 * Note that allocated entries might be NULL as a transient state when
> > > +	 * creating or deleting a handle.
> > >   	 */
> > >   	struct idr object_idr;
> > 
> > -- 
> > --
> > Thomas Zimmermann
> > Graphics Driver Developer
> > SUSE Software Solutions Germany GmbH
> > Frankenstrasse 146, 90461 Nuernberg, Germany
> > GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman
> > HRB 36809 (AG Nuernberg)
> > 
> 
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/8] accel/qaic: delete qaic_bo.handle
  2025-05-28  9:13 ` [PATCH 4/8] accel/qaic: delete qaic_bo.handle Simona Vetter
  2025-05-28 15:15   ` Jeff Hugo
@ 2025-06-06 16:25   ` Jeff Hugo
  1 sibling, 0 replies; 27+ messages in thread
From: Jeff Hugo @ 2025-06-06 16:25 UTC (permalink / raw)
  To: Simona Vetter, DRI Development
  Cc: intel-xe, Carl Vanderlip, linux-arm-msm, Simona Vetter

On 5/28/2025 3:13 AM, Simona Vetter wrote:
> Handles are per-file, not global, so this makes no sense. Plus it's
> set only after calling drm_gem_handle_create(), and drivers are not
> allowed to further intialize a bo after that function has published it
> already.
> 
> It is also entirely unused, which helps enormously with removing it
> :-)
> 
> Since we're still holding a reference to the bo nothing bad can
> happen, hence not cc: stable material.
> 
> Cc: Jeff Hugo <jeff.hugo@oss.qualcomm.com>
> Cc: Carl Vanderlip <quic_carlv@quicinc.com>
> Cc: linux-arm-msm@vger.kernel.org
> Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
> Signed-off-by: Simona Vetter <simona.vetter@intel.com>

Pushed to drm-misc-next

-Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-06-06 16:25 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-28  9:12 [PATCH 0/8] drm/gem: Audit around handle_create races Simona Vetter
2025-05-28  9:12 ` [PATCH 1/8] drm/gem: Fix race in drm_gem_handle_create_tail() Simona Vetter
2025-05-28  9:26   ` Simona Vetter
2025-05-28 13:20   ` Jacek Lawrynowicz
2025-06-02 15:15   ` Thomas Zimmermann
2025-06-03 11:45     ` Simona Vetter
2025-06-03 12:40       ` Thomas Zimmermann
2025-06-04  9:02       ` Simona Vetter
2025-05-28  9:13 ` [PATCH 2/8] drm/fdinfo: Switch to idr_for_each() in drm_show_memory_stats() Simona Vetter
2025-05-28  9:22   ` Simona Vetter
2025-05-28 20:10   ` kernel test robot
2025-05-28  9:13 ` [PATCH 3/8] drm/panthor: Fix UAF in panthor_gem_create_with_handle() debugfs code Simona Vetter
2025-05-29 12:31   ` kernel test robot
2025-06-01 14:06   ` Adrián Larumbe
2025-06-02 14:46     ` Simona Vetter
2025-05-28  9:13 ` [PATCH 4/8] accel/qaic: delete qaic_bo.handle Simona Vetter
2025-05-28 15:15   ` Jeff Hugo
2025-06-02 14:43     ` Simona Vetter
2025-06-03 14:43       ` Jeff Hugo
2025-06-06 16:25   ` Jeff Hugo
2025-05-28  9:13 ` [PATCH 5/8] drm/amd/kfd: Add comment about possible drm_gem_handle_create() race Simona Vetter
2025-05-28  9:13 ` [PATCH 6/8] drm/amdgpu: Add comments about drm_file.object_idr issues Simona Vetter
2025-05-28  9:22   ` Simona Vetter
2025-05-28  9:13 ` [PATCH 7/8] drm/vmwgfx: " Simona Vetter
2025-05-28  9:23   ` Simona Vetter
2025-05-28  9:13 ` [PATCH 8/8] drm/xe: " Simona Vetter
2025-05-28  9:24   ` Simona Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).