[RFC PATCH V7 00/10] Add memory page offlining support

public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed

* [RFC PATCH V7 00/10] Add memory page offlining support
@ 2026-04-16  7:49 Tejas Upadhyay
  2026-04-16  7:49 ` [RFC PATCH V7 01/10] drm/xe: Link VRAM object with gpu buddy Tejas Upadhyay
                   ` (11 more replies)
  0 siblings, 12 replies; 19+ messages in thread
From: Tejas Upadhyay @ 2026-04-16  7:49 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

Add memory page offlining support
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This functionality represents a significant step in making
the xe driver gracefully handle hardware memory degradation.
By integrating with the DRM Buddy allocator, the driver
can permanently "carve out" faulty memory so it isn't reused
by subsequent allocations.

This series adds memory page offlining support with following:
1. Link VRAM object with gpu buddy
2. Integrate lockdep for gpu buddy manager
3. Link and track ttm BO's with physical addresses
4. Link LRC BO and its execution Queue
5. Extend BO purge to handle vram pages as well
6. Handle the generated physical address error by reserving addresses 4K page
7. Adds supporting debugfs to automate injection of physcal address error
8. Add buddy block allocation dump for debuggin buddy related issues
9. Add configfs for vram bad page reservation policy
10. Sysfs entry to provide statistics of bad gpu vram pages for user info

v7:
- Improve debugfs warning messages
- Use scope_guard for locking(MattB)
- Adapt addition of queue member of LRC BO(MattB)
- Extend and use xe_ttm_bo_purge API for vram pages(MattB)
- Handle dma_buf_map requests for native and remote(MattB)
- Address if in never initialized block, set block to NULL
- Add lockdep in gpu buddy (MattB)
- Correct allocated_addr_to_block logic (MattA)
V6:
- Add more specific tests to noncritical bo sections
- Handle smooth exit of user created exec queues
- Break code and make purge specific static API
V5:
- Sysfs "max_pages" addition
- Reset block->private NULL post purge
- Remove wedge, return -EIO to system controller will initiate reset
- Add debugfs tests to trigger different test scenarios manually and via igt
- Rename addr_to_tbo to addr_to_block and move under gpu/buddy.c
V4: API reworks, add configfs for policy reservation and apply config everywhere
V3: use res_to_mem_region to avoid use of block->private (MattA)
V2:
- some fixes and clean up on errors
- Added xe_vram_addr_to_region helper to avoid other use of block->private (MattB)

Debugfs shows test of different scenarios,
echo 0 > /sys/kernel/debug/dri/bdf/invalid_addr_vram0
where 0 is below address types to be tested,
enum mempage_offline_mode {
        MEMPAGE_OFFLINE_UNALLOCATED = 0,
        MEMPAGE_OFFLINE_USER_ALLOCATED = 1,
        MEMPAGE_OFFLINE_KERNEL_USER_GGTT_ALLOCATED = 2,
        MEMPAGE_OFFLINE_KERNEL_USER_PPGTT_ALLOCATED = 3,
        MEMPAGE_OFFLINE_KERNEL_CRITICAL_ALLOCATED = 4,
        MEMPAGE_OFFLINE_RESERVED = 5,
};

IGT tests for testing this feature:
https://patchwork.freedesktop.org/patch/714751/

Results of above tests:
Using IGT_SRANDOM=1774610050 for randomisation
Opened device: /dev/dri/card0
Starting subtest: unallocated
Subtest unallocated: SUCCESS (1.834s)
Starting subtest: user-allocated
Subtest user-allocated: SUCCESS (1.832s)
Starting subtest: user-ggtt-allocated
Subtest user-ggtt-allocated: SUCCESS (1.871s)
Starting subtest: user-ppgtt-allocated
Subtest user-ppgtt-allocated: SUCCESS (1.843s)
Starting subtest: critical-allocated
Subtest critical-allocated: SUCCESS (1.824s)
Starting subtest: reserved
Subtest reserved: SUCCESS (0.032s)


Tejas Upadhyay (10):
  drm/xe: Link VRAM object with gpu buddy
  gpu/buddy: Integrate lockdep for gpu buddy manager
  drm/gpu: Add gpu_buddy_allocated_addr_to_block helper
  drm/xe: Link LRC BO and its execution Queue
  drm/xe: Extend BO purge to handle vram pages as well
  drm/xe: Handle physical memory address error
  drm/xe/cri: Add debugfs to inject faulty vram address
  gpu/buddy: Add routine to dump allocated buddy blocks
  drm/xe/configfs: Add vram bad page reservation policy
  drm/xe/cri: Add sysfs interface for bad gpu vram pages

 drivers/gpu/buddy.c                        | 114 ++++++-
 drivers/gpu/drm/drm_buddy.c                |   7 +-
 drivers/gpu/drm/xe/xe_bo.c                 |  16 +-
 drivers/gpu/drm/xe/xe_bo.h                 |   5 +-
 drivers/gpu/drm/xe/xe_bo_types.h           |   3 +
 drivers/gpu/drm/xe/xe_configfs.c           |  64 +++-
 drivers/gpu/drm/xe/xe_configfs.h           |   2 +
 drivers/gpu/drm/xe/xe_debugfs.c            | 171 ++++++++++
 drivers/gpu/drm/xe/xe_device_sysfs.c       |   7 +
 drivers/gpu/drm/xe/xe_dma_buf.c            |   3 +
 drivers/gpu/drm/xe/xe_exec_queue.c         |  10 +-
 drivers/gpu/drm/xe/xe_pt.c                 |   3 +-
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       | 371 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.h       |   2 +
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  32 ++
 include/linux/gpu_buddy.h                  |  44 +++
 16 files changed, 839 insertions(+), 15 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH V7 01/10] drm/xe: Link VRAM object with gpu buddy
  2026-04-16  7:49 [RFC PATCH V7 00/10] Add memory page offlining support Tejas Upadhyay
@ 2026-04-16  7:49 ` Tejas Upadhyay
  2026-04-16  7:49 ` [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for gpu buddy manager Tejas Upadhyay
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Tejas Upadhyay @ 2026-04-16  7:49 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

Setup to link TTM buffer object inside gpu buddy. This functionality
is critical for supporting the memory page offline feature on CRI,
where identified faulty pages must be traced back to their
originating buffer for safe removal.

V2(MattB): Clear block->private in xe_ttm_vram_mgr_del as well

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index 5fd0d5506a7e..01a9b92772f8 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -54,6 +54,7 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager *man,
 	struct xe_ttm_vram_mgr *mgr = to_xe_ttm_vram_mgr(man);
 	struct xe_ttm_vram_mgr_resource *vres;
 	struct gpu_buddy *mm = &mgr->mm;
+	struct gpu_buddy_block *block;
 	u64 size, min_page_size;
 	unsigned long lpfn;
 	int err;
@@ -138,6 +139,8 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager *man,
 	}
 
 	mgr->visible_avail -= vres->used_visible_size;
+	list_for_each_entry(block, &vres->blocks, link)
+		block->private = tbo;
 	mutex_unlock(&mgr->lock);
 
 	if (!(vres->base.placement & TTM_PL_FLAG_CONTIGUOUS) &&
@@ -176,8 +179,11 @@ static void xe_ttm_vram_mgr_del(struct ttm_resource_manager *man,
 		to_xe_ttm_vram_mgr_resource(res);
 	struct xe_ttm_vram_mgr *mgr = to_xe_ttm_vram_mgr(man);
 	struct gpu_buddy *mm = &mgr->mm;
+	struct gpu_buddy_block *block;
 
 	mutex_lock(&mgr->lock);
+	list_for_each_entry(block, &vres->blocks, link)
+		block->private = NULL;
 	gpu_buddy_free_list(mm, &vres->blocks, 0);
 	mgr->visible_avail += vres->used_visible_size;
 	mutex_unlock(&mgr->lock);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for gpu buddy manager
  2026-04-16  7:49 [RFC PATCH V7 00/10] Add memory page offlining support Tejas Upadhyay
  2026-04-16  7:49 ` [RFC PATCH V7 01/10] drm/xe: Link VRAM object with gpu buddy Tejas Upadhyay
@ 2026-04-16  7:49 ` Tejas Upadhyay
  2026-04-16  8:55   ` Matthew Auld
  2026-04-16  7:49 ` [RFC PATCH V7 03/10] drm/gpu: Add gpu_buddy_allocated_addr_to_block helper Tejas Upadhyay
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 19+ messages in thread
From: Tejas Upadhyay @ 2026-04-16  7:49 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

Integrating lockdep into the gpu_buddy manager as standard practice for
verifying that internal resources are correctly protected by their
associated locks.

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/buddy.c                  | 18 ++++++++++--
 drivers/gpu/drm/drm_buddy.c          |  7 +++--
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c |  3 ++
 include/linux/gpu_buddy.h            | 41 ++++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/buddy.c b/drivers/gpu/buddy.c
index 52686672e99f..53ff85ac2105 100644
--- a/drivers/gpu/buddy.c
+++ b/drivers/gpu/buddy.c
@@ -437,6 +437,9 @@ int gpu_buddy_init(struct gpu_buddy *mm, u64 size, u64 chunk_size)
 		root_count++;
 	} while (size);
 
+#ifdef CONFIG_LOCKDEP
+	mm->lock_dep_map = NULL;
+#endif
 	return 0;
 
 out_free_roots:
@@ -464,6 +467,7 @@ void gpu_buddy_fini(struct gpu_buddy *mm)
 	unsigned int order;
 	int i;
 
+	gpu_buddy_driver_lock_held(mm);
 	size = mm->size;
 
 	for (i = 0; i < mm->n_roots; ++i) {
@@ -538,6 +542,7 @@ void gpu_buddy_reset_clear(struct gpu_buddy *mm, bool is_clear)
 	unsigned int order;
 	int i;
 
+	gpu_buddy_driver_lock_held(mm);
 	size = mm->size;
 	for (i = 0; i < mm->n_roots; ++i) {
 		order = ilog2(size) - ilog2(mm->chunk_size);
@@ -580,6 +585,7 @@ EXPORT_SYMBOL(gpu_buddy_reset_clear);
 void gpu_buddy_free_block(struct gpu_buddy *mm,
 			  struct gpu_buddy_block *block)
 {
+	gpu_buddy_driver_lock_held(mm);
 	BUG_ON(!gpu_buddy_block_is_allocated(block));
 	mm->avail += gpu_buddy_block_size(mm, block);
 	if (gpu_buddy_block_is_clear(block))
@@ -633,6 +639,7 @@ void gpu_buddy_free_list(struct gpu_buddy *mm,
 {
 	bool mark_clear = flags & GPU_BUDDY_CLEARED;
 
+	gpu_buddy_driver_lock_held(mm);
 	__gpu_buddy_free_list(mm, objects, mark_clear, !mark_clear);
 }
 EXPORT_SYMBOL(gpu_buddy_free_list);
@@ -1172,6 +1179,8 @@ int gpu_buddy_block_trim(struct gpu_buddy *mm,
 	u64 new_start;
 	int err;
 
+	gpu_buddy_driver_lock_held(mm);
+
 	if (!list_is_singular(blocks))
 		return -EINVAL;
 
@@ -1287,6 +1296,8 @@ int gpu_buddy_alloc_blocks(struct gpu_buddy *mm,
 	unsigned long pages;
 	int err;
 
+	gpu_buddy_driver_lock_held(mm);
+
 	if (size < mm->chunk_size)
 		return -EINVAL;
 
@@ -1458,9 +1469,11 @@ EXPORT_SYMBOL(gpu_buddy_alloc_blocks);
 void gpu_buddy_block_print(struct gpu_buddy *mm,
 			   struct gpu_buddy_block *block)
 {
-	u64 start = gpu_buddy_block_offset(block);
-	u64 size = gpu_buddy_block_size(mm, block);
+	u64 start, size;
 
+	gpu_buddy_driver_lock_held(mm);
+	start = gpu_buddy_block_offset(block);
+	size = gpu_buddy_block_size(mm, block);
 	pr_info("%#018llx-%#018llx: %llu\n", start, start + size, size);
 }
 EXPORT_SYMBOL(gpu_buddy_block_print);
@@ -1475,6 +1488,7 @@ void gpu_buddy_print(struct gpu_buddy *mm)
 {
 	int order;
 
+	gpu_buddy_driver_lock_held(mm);
 	pr_info("chunk_size: %lluKiB, total: %lluMiB, free: %lluMiB, clear_free: %lluMiB\n",
 		mm->chunk_size >> 10, mm->size >> 20, mm->avail >> 20, mm->clear_avail >> 20);
 
diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
index 841f3de5f307..f4ad09b8a36e 100644
--- a/drivers/gpu/drm/drm_buddy.c
+++ b/drivers/gpu/drm/drm_buddy.c
@@ -25,9 +25,11 @@ void drm_buddy_block_print(struct gpu_buddy *mm,
 			   struct gpu_buddy_block *block,
 			   struct drm_printer *p)
 {
-	u64 start = gpu_buddy_block_offset(block);
-	u64 size = gpu_buddy_block_size(mm, block);
+	u64 start, size;
 
+	gpu_buddy_driver_lock_held(mm);
+	start = gpu_buddy_block_offset(block);
+	size = gpu_buddy_block_size(mm, block);
 	drm_printf(p, "%#018llx-%#018llx: %llu\n", start, start + size, size);
 }
 EXPORT_SYMBOL(drm_buddy_block_print);
@@ -42,6 +44,7 @@ void drm_buddy_print(struct gpu_buddy *mm, struct drm_printer *p)
 {
 	int order;
 
+	gpu_buddy_driver_lock_held(mm);
 	drm_printf(p, "chunk_size: %lluKiB, total: %lluMiB, free: %lluMiB, clear_free: %lluMiB\n",
 		   mm->chunk_size >> 10, mm->size >> 20, mm->avail >> 20, mm->clear_avail >> 20);
 
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index 01a9b92772f8..935e589dd4b0 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -293,7 +293,9 @@ static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
 
 	WARN_ON_ONCE(mgr->visible_avail != mgr->visible_size);
 
+	mutex_lock(&mgr->lock);
 	gpu_buddy_fini(&mgr->mm);
+	mutex_unlock(&mgr->lock);
 
 	ttm_resource_manager_cleanup(&mgr->manager);
 
@@ -328,6 +330,7 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
 	if (err)
 		return err;
 
+	gpu_buddy_driver_set_lock(&mgr->mm, &mgr->lock);
 	ttm_set_driver_manager(&xe->ttm, mem_type, &mgr->manager);
 	ttm_resource_manager_set_used(&mgr->manager, true);
 
diff --git a/include/linux/gpu_buddy.h b/include/linux/gpu_buddy.h
index 5fa917ba5450..c174de80ad72 100644
--- a/include/linux/gpu_buddy.h
+++ b/include/linux/gpu_buddy.h
@@ -154,6 +154,7 @@ struct gpu_buddy_block {
  * @avail: Total free space currently available for allocation in bytes.
  * @clear_avail: Free space available in the clear tree (zeroed memory) in bytes.
  *               This is a subset of @avail.
+ * @lock_dep_map: Annotates gpu_buddy API with a driver provided lock.
  */
 struct gpu_buddy {
 /* private: */
@@ -179,8 +180,48 @@ struct gpu_buddy {
 	u64 size;
 	u64 avail;
 	u64 clear_avail;
+#ifdef CONFIG_LOCKDEP
+	struct lockdep_map *lock_dep_map;
+#endif
 };
 
+#ifdef CONFIG_LOCKDEP
+/**
+ * gpu_buddy_driver_set_lock() - Set the lock protecting accesses to GPU BUDDY
+ * @mm: Pointer to GPU buddy structure.
+ * @lock: the lock used to protect the gpu buddy. The locking primitive
+ * must contain a dep_map field.
+ *
+ * Call this to annotate gpu_buddy APIs which access/modify gpu_buddy manager
+ */
+#define gpu_buddy_driver_set_lock(mm, lock) \
+	do { \
+		struct gpu_buddy *__mm = (mm); \
+		if (!WARN(__mm->lock_dep_map, "GPU BUDDY MM lock should be set only once.")) \
+			__mm->lock_dep_map = &(lock)->dep_map; \
+	} while (0)
+#else
+#define gpu_buddy_driver_set_lock(mm, lock) do { (void)(mm); (void)(lock); } while (0)
+#endif
+
+#ifdef CONFIG_LOCKDEP
+/**
+ * gpu_buddy_driver_lock_held() - Assert GPU BUDDY manager lock is held
+ * @mm: Pointer to the GPU BUDDY structure.
+ *
+ * Ensure driver lock is held.
+ */
+static inline void gpu_buddy_driver_lock_held(struct gpu_buddy *mm)
+{
+	if ((mm)->lock_dep_map)
+		lockdep_assert(lock_is_held_type((mm)->lock_dep_map, 0));
+}
+#else
+static inline gpu_buddy_driver_lock_held(struct gpu_buddy *mm)
+{
+}
+#endif
+
 static inline u64
 gpu_buddy_block_offset(const struct gpu_buddy_block *block)
 {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V7 03/10] drm/gpu: Add gpu_buddy_allocated_addr_to_block helper
  2026-04-16  7:49 [RFC PATCH V7 00/10] Add memory page offlining support Tejas Upadhyay
  2026-04-16  7:49 ` [RFC PATCH V7 01/10] drm/xe: Link VRAM object with gpu buddy Tejas Upadhyay
  2026-04-16  7:49 ` [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for gpu buddy manager Tejas Upadhyay
@ 2026-04-16  7:49 ` Tejas Upadhyay
  2026-04-16  7:49 ` [RFC PATCH V7 04/10] drm/xe: Link LRC BO and its execution Queue Tejas Upadhyay
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Tejas Upadhyay @ 2026-04-16  7:49 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

Add helper with primary purpose is to efficiently trace a specific
physical memory address back to its corresponding TTM buffer object.

v2:
- %s/gpu_buddy_addr_to_block/gpu_buddy_allocated_addr_to_block(MattA)
- remove clear->avail and split nodes check(MattA)
- Adapt lockdep(MattB)

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/buddy.c       | 53 +++++++++++++++++++++++++++++++++++++++
 include/linux/gpu_buddy.h |  2 ++
 2 files changed, 55 insertions(+)

diff --git a/drivers/gpu/buddy.c b/drivers/gpu/buddy.c
index 53ff85ac2105..44bf926aca56 100644
--- a/drivers/gpu/buddy.c
+++ b/drivers/gpu/buddy.c
@@ -595,6 +595,59 @@ void gpu_buddy_free_block(struct gpu_buddy *mm,
 }
 EXPORT_SYMBOL(gpu_buddy_free_block);
 
+/**
+ * gpu_buddy_allocated_addr_to_block - given relative address find the allocated block
+ *
+ * @mm: GPU buddy manager
+ * @addr: Relative address
+ *
+ * Returns:
+ * gpu_buddy_block on success, NULL or error code on failure
+ */
+struct gpu_buddy_block *gpu_buddy_allocated_addr_to_block(struct gpu_buddy *mm, u64 addr)
+{
+	struct gpu_buddy_block *block;
+	LIST_HEAD(dfs);
+	u64 end;
+	int i;
+
+	gpu_buddy_driver_lock_held(mm);
+
+	end = addr + SZ_4K - 1;
+	for (i = 0; i < mm->n_roots; ++i)
+		list_add_tail(&mm->roots[i]->tmp_link, &dfs);
+
+	do {
+		u64 block_start;
+		u64 block_end;
+
+		block = list_first_entry_or_null(&dfs,
+						 struct gpu_buddy_block,
+						 tmp_link);
+		if (!block)
+			break;
+
+		list_del(&block->tmp_link);
+
+		block_start = gpu_buddy_block_offset(block);
+		block_end = block_start + gpu_buddy_block_size(mm, block) - 1;
+
+		if (!overlaps(addr, end, block_start, block_end))
+			continue;
+
+		if (gpu_buddy_block_is_allocated(block))
+			return block;
+		else if (gpu_buddy_block_is_free(block))
+			return NULL;
+
+		list_add(&block->right->tmp_link, &dfs);
+		list_add(&block->left->tmp_link, &dfs);
+	} while (1);
+
+	return ERR_PTR(-ENXIO);
+}
+EXPORT_SYMBOL(gpu_buddy_allocated_addr_to_block);
+
 static void __gpu_buddy_free_list(struct gpu_buddy *mm,
 				  struct list_head *objects,
 				  bool mark_clear,
diff --git a/include/linux/gpu_buddy.h b/include/linux/gpu_buddy.h
index c174de80ad72..e42c38a2a9f5 100644
--- a/include/linux/gpu_buddy.h
+++ b/include/linux/gpu_buddy.h
@@ -272,6 +272,8 @@ void gpu_buddy_reset_clear(struct gpu_buddy *mm, bool is_clear);
 
 void gpu_buddy_free_block(struct gpu_buddy *mm, struct gpu_buddy_block *block);
 
+struct gpu_buddy_block *gpu_buddy_allocated_addr_to_block(struct gpu_buddy *mm, u64 addr);
+
 void gpu_buddy_free_list(struct gpu_buddy *mm,
 			 struct list_head *objects,
 			 unsigned int flags);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V7 04/10] drm/xe: Link LRC BO and its execution Queue
  2026-04-16  7:49 [RFC PATCH V7 00/10] Add memory page offlining support Tejas Upadhyay
                   ` (2 preceding siblings ...)
  2026-04-16  7:49 ` [RFC PATCH V7 03/10] drm/gpu: Add gpu_buddy_allocated_addr_to_block helper Tejas Upadhyay
@ 2026-04-16  7:49 ` Tejas Upadhyay
  2026-04-16  7:49 ` [RFC PATCH V7 05/10] drm/xe: Extend BO purge to handle vram pages as well Tejas Upadhyay
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Tejas Upadhyay @ 2026-04-16  7:49 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

To establish a link between an LRC BO (Logical Ring Context
Buffer Object) and its corresponding execution Queue in the
drm/xe driver, you need to store a back-pointer to the queue
within the BO's private data structure. This allows the
driver to identify and take corrective action on the specific
queue if the LRC BO encounters an error (e.g., memory
corruption or eviction issues).

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_bo_types.h   | 3 +++
 drivers/gpu/drm/xe/xe_exec_queue.c | 1 +
 2 files changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_bo_types.h b/drivers/gpu/drm/xe/xe_bo_types.h
index 9d19940b8fc0..a647b775a65c 100644
--- a/drivers/gpu/drm/xe/xe_bo_types.h
+++ b/drivers/gpu/drm/xe/xe_bo_types.h
@@ -20,6 +20,7 @@
 struct xe_device;
 struct xe_mem_pool_node;
 struct xe_vm;
+struct xe_exec_queue;
 
 #define XE_BO_MAX_PLACEMENTS	3
 
@@ -40,6 +41,8 @@ struct xe_bo {
 	u32 flags;
 	/** @vm: VM this BO is attached to, for extobj this will be NULL */
 	struct xe_vm *vm;
+	/** @q: Queue this BO is attached to, mostly for LRC BO, NULL otherwise */
+	struct xe_exec_queue *q;
 	/** @tile: Tile this BO is attached to (kernel BO only) */
 	struct xe_tile *tile;
 	/** @placements: valid placements for this BO */
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
index 071b8c41df43..632c9603afc1 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -386,6 +386,7 @@ static int __xe_exec_queue_init(struct xe_exec_queue *q, u32 exec_queue_flags)
 				goto err_lrc;
 			}
 
+			lrc->bo->q = q;
 			xe_exec_queue_set_lrc(q, lrc, i);
 
 			if (__lrc)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V7 05/10] drm/xe: Extend BO purge to handle vram pages as well
  2026-04-16  7:49 [RFC PATCH V7 00/10] Add memory page offlining support Tejas Upadhyay
                   ` (3 preceding siblings ...)
  2026-04-16  7:49 ` [RFC PATCH V7 04/10] drm/xe: Link LRC BO and its execution Queue Tejas Upadhyay
@ 2026-04-16  7:49 ` Tejas Upadhyay
  2026-04-16  7:49 ` [RFC PATCH V7 06/10] drm/xe: Handle physical memory address error Tejas Upadhyay
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Tejas Upadhyay @ 2026-04-16  7:49 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

Recent driver update introduce support for purgeable buffer
objects (BOs), extending the API to include VRAM pages to
better manage memory pressure and enable memory offlining.

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c | 5 +----
 drivers/gpu/drm/xe/xe_bo.h | 1 +
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 5ce60d161e09..04d3b25c7c8e 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -903,7 +903,7 @@ void xe_bo_set_purgeable_state(struct xe_bo *bo,
  *
  * Return: 0 on success, negative error code on failure
  */
-static int xe_ttm_bo_purge(struct ttm_buffer_object *ttm_bo, struct ttm_operation_ctx *ctx)
+int xe_ttm_bo_purge(struct ttm_buffer_object *ttm_bo, struct ttm_operation_ctx *ctx)
 {
 	struct xe_bo *bo = ttm_to_xe_bo(ttm_bo);
 	struct ttm_placement place = {};
@@ -911,9 +911,6 @@ static int xe_ttm_bo_purge(struct ttm_buffer_object *ttm_bo, struct ttm_operatio
 
 	xe_bo_assert_held(bo);
 
-	if (!ttm_bo->ttm)
-		return 0;
-
 	if (!xe_bo_madv_is_dontneed(bo))
 		return 0;
 
diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
index 68dea7d25a6b..9f55b3589caf 100644
--- a/drivers/gpu/drm/xe/xe_bo.h
+++ b/drivers/gpu/drm/xe/xe_bo.h
@@ -500,6 +500,7 @@ struct xe_bo_shrink_flags {
 long xe_bo_shrink(struct ttm_operation_ctx *ctx, struct ttm_buffer_object *bo,
 		  const struct xe_bo_shrink_flags flags,
 		  unsigned long *scanned);
+int xe_ttm_bo_purge(struct ttm_buffer_object *ttm_bo, struct ttm_operation_ctx *ctx);
 
 /**
  * xe_bo_is_mem_type - Whether the bo currently resides in the given
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V7 06/10] drm/xe: Handle physical memory address error
  2026-04-16  7:49 [RFC PATCH V7 00/10] Add memory page offlining support Tejas Upadhyay
                   ` (4 preceding siblings ...)
  2026-04-16  7:49 ` [RFC PATCH V7 05/10] drm/xe: Extend BO purge to handle vram pages as well Tejas Upadhyay
@ 2026-04-16  7:49 ` Tejas Upadhyay
  2026-04-16  7:49 ` [RFC PATCH V7 07/10] drm/xe/cri: Add debugfs to inject faulty vram address Tejas Upadhyay
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Tejas Upadhyay @ 2026-04-16  7:49 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

This functionality represents a significant step in making
the xe driver gracefully handle hardware memory degradation.
By integrating with the DRM Buddy allocator, the driver
can permanently "carve out" faulty memory so it isn't reused
by subsequent allocations.

Buddy Block Reservation:
----------------------
When a memory address is reported as faulty, the driver instructs
the DRM Buddy allocator to reserve a block of the specific page
size (typically 4KB). This marks the memory as "dirty/used"
indefinitely.

Two-Stage Tracking:
-----------------
Offlined Pages:
Pages that have been successfully isolated and removed from the
available memory pool.

Queued Pages:
Addresses that have been flagged as faulty but are currently in
use by a process. These are tracked until the associated buffer
object (BO) is released or migrated, at which point they move
to the "offlined" state.

v7:
- keep vm ref during vm kill and fix some typos
- FW communication code is moved in RAS, keep comment for same
V6:
- Use scope_guard for locking(MattB)
- Adapt addition of queue member of LRC BO(MattB)
- Extend and use xe_ttm_bo_purge API for vram pages(MattB)
- Handle dma_buf_map requests for native and remote(MattB)
- Address if in never initialized block, set block to NULL
V5:
- Categorise and handle BOs accordingly
- Fix crash found with new debugfs tests
V4:
- Set block->private NULL post bo purge
- Filter out gsm address early on
- Rebase
V3:
-rename api, remove tile dependency and add status of reservation
V2:
- Fix mm->avail counter issue
- Remove unused code and handle clean up in case of error

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c                 |  11 +-
 drivers/gpu/drm/xe/xe_bo.h                 |   4 +-
 drivers/gpu/drm/xe/xe_dma_buf.c            |   3 +
 drivers/gpu/drm/xe/xe_exec_queue.c         |   9 +-
 drivers/gpu/drm/xe/xe_pt.c                 |   3 +-
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       | 273 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.h       |   1 +
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  28 +++
 8 files changed, 326 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 04d3b25c7c8e..275e20a7e733 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -158,7 +158,16 @@ bool xe_bo_is_vm_bound(struct xe_bo *bo)
 	return !list_empty(&bo->ttm.base.gpuva.list);
 }
 
-static bool xe_bo_is_user(struct xe_bo *bo)
+/**
+ * xe_bo_is_user - check if BO is user created BO
+ * @bo: The BO
+ *
+ * Check if  BO is user created BO. This requires the
+ * reservation lock for the BO to be held.
+ *
+ * Returns: boolean
+ */
+bool xe_bo_is_user(struct xe_bo *bo)
 {
 	return bo->flags & XE_BO_FLAG_USER;
 }
diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
index 9f55b3589caf..073fae905073 100644
--- a/drivers/gpu/drm/xe/xe_bo.h
+++ b/drivers/gpu/drm/xe/xe_bo.h
@@ -277,7 +277,8 @@ static inline void xe_bo_unpin_map_no_vm(struct xe_bo *bo)
 {
 	if (likely(bo)) {
 		xe_bo_lock(bo, false);
-		xe_bo_unpin(bo);
+		if (!xe_bo_is_purged(bo))
+			xe_bo_unpin(bo);
 		xe_bo_unlock(bo);
 
 		xe_bo_put(bo);
@@ -501,6 +502,7 @@ long xe_bo_shrink(struct ttm_operation_ctx *ctx, struct ttm_buffer_object *bo,
 		  const struct xe_bo_shrink_flags flags,
 		  unsigned long *scanned);
 int xe_ttm_bo_purge(struct ttm_buffer_object *ttm_bo, struct ttm_operation_ctx *ctx);
+bool xe_bo_is_user(struct xe_bo *bo);
 
 /**
  * xe_bo_is_mem_type - Whether the bo currently resides in the given
diff --git a/drivers/gpu/drm/xe/xe_dma_buf.c b/drivers/gpu/drm/xe/xe_dma_buf.c
index b9828da15897..21bf152f387d 100644
--- a/drivers/gpu/drm/xe/xe_dma_buf.c
+++ b/drivers/gpu/drm/xe/xe_dma_buf.c
@@ -104,6 +104,9 @@ static struct sg_table *xe_dma_buf_map(struct dma_buf_attachment *attach,
 	struct sg_table *sgt;
 	int r = 0;
 
+	if (xe_bo_is_purged(bo))
+		return ERR_PTR(-ENOENT);
+
 	if (!attach->peer2peer && !xe_bo_can_migrate(bo, XE_PL_TT))
 		return ERR_PTR(-EOPNOTSUPP);
 
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c
index 632c9603afc1..7b8eb2c01634 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -385,7 +385,6 @@ static int __xe_exec_queue_init(struct xe_exec_queue *q, u32 exec_queue_flags)
 				err = PTR_ERR(lrc);
 				goto err_lrc;
 			}
-
 			lrc->bo->q = q;
 			xe_exec_queue_set_lrc(q, lrc, i);
 
@@ -1555,8 +1554,12 @@ void xe_exec_queue_update_run_ticks(struct xe_exec_queue *q)
 	 * errors.
 	 */
 	lrc = q->lrc[0];
-	new_ts = xe_lrc_update_timestamp(lrc, &old_ts);
-	q->xef->run_ticks[q->class] += (new_ts - old_ts) * q->width;
+	xe_bo_lock(lrc->bo, false);
+	if (!xe_bo_is_purged(lrc->bo)) {
+		new_ts = xe_lrc_update_timestamp(lrc, &old_ts);
+		q->xef->run_ticks[q->class] += (new_ts - old_ts) * q->width;
+	}
+	xe_bo_unlock(lrc->bo);
 
 	drm_dev_exit(idx);
 }
diff --git a/drivers/gpu/drm/xe/xe_pt.c b/drivers/gpu/drm/xe/xe_pt.c
index 8e5f4f0dea3f..1764bae6e481 100644
--- a/drivers/gpu/drm/xe/xe_pt.c
+++ b/drivers/gpu/drm/xe/xe_pt.c
@@ -211,7 +211,8 @@ void xe_pt_destroy(struct xe_pt *pt, u32 flags, struct llist_head *deferred)
 		return;
 
 	XE_WARN_ON(!list_empty(&pt->bo->ttm.base.gpuva.list));
-	xe_bo_unpin(pt->bo);
+	if (!xe_bo_is_purged(pt->bo))
+		xe_bo_unpin(pt->bo);
 	xe_bo_put_deferred(pt->bo, deferred);
 
 	if (pt->level > 0 && pt->num_live) {
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index 935e589dd4b0..fcf32360f240 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -13,7 +13,10 @@
 
 #include "xe_bo.h"
 #include "xe_device.h"
+#include "xe_exec_queue.h"
+#include "xe_lrc.h"
 #include "xe_res_cursor.h"
+#include "xe_ttm_stolen_mgr.h"
 #include "xe_ttm_vram_mgr.h"
 #include "xe_vram_types.h"
 
@@ -280,6 +283,25 @@ static const struct ttm_resource_manager_func xe_ttm_vram_mgr_func = {
 	.debug	= xe_ttm_vram_mgr_debug
 };
 
+static void xe_ttm_vram_free_bad_pages(struct drm_device *dev, struct xe_ttm_vram_mgr *mgr)
+{
+	struct xe_ttm_vram_offline_resource *pos, *n;
+
+	list_for_each_entry_safe(pos, n, &mgr->offlined_pages, offlined_link) {
+		gpu_buddy_free_list(&mgr->mm, &pos->blocks, 0);
+		mgr->visible_avail += pos->used_visible_size;
+		list_del(&pos->offlined_link);
+		--mgr->n_offlined_pages;
+		kfree(pos);
+	}
+	list_for_each_entry_safe(pos, n, &mgr->queued_pages, queued_link) {
+		gpu_buddy_free_list(&mgr->mm, &pos->blocks, 0);
+		list_del(&pos->queued_link);
+		--mgr->n_queued_pages;
+		kfree(pos);
+	}
+}
+
 static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
 {
 	struct xe_device *xe = to_xe_device(dev);
@@ -291,6 +313,10 @@ static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
 	if (ttm_resource_manager_evict_all(&xe->ttm, man))
 		return;
 
+	mutex_lock(&mgr->lock);
+	xe_ttm_vram_free_bad_pages(dev, mgr);
+	mutex_unlock(&mgr->lock);
+
 	WARN_ON_ONCE(mgr->visible_avail != mgr->visible_size);
 
 	mutex_lock(&mgr->lock);
@@ -321,6 +347,8 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
 	man->func = &xe_ttm_vram_mgr_func;
 	mgr->mem_type = mem_type;
 	mutex_init(&mgr->lock);
+	INIT_LIST_HEAD(&mgr->offlined_pages);
+	INIT_LIST_HEAD(&mgr->queued_pages);
 	mgr->default_page_size = default_page_size;
 	mgr->visible_size = io_size;
 	mgr->visible_avail = io_size;
@@ -477,3 +505,248 @@ u64 xe_ttm_vram_get_avail(struct ttm_resource_manager *man)
 
 	return avail;
 }
+
+static int xe_ttm_vram_purge_page(struct xe_device *xe, struct xe_bo *bo)
+{
+	struct ttm_operation_ctx ctx = {};
+	struct xe_vm *vm = NULL;
+	u32	flags;
+	int ret = 0;
+
+	xe_bo_lock(bo, false);
+	if (bo->vm)
+		vm = xe_vm_get(bo->vm);
+	flags = bo->flags;
+	xe_bo_unlock(bo);
+	/*  Ban VM if BO is PPGTT */
+	if (vm && (flags & XE_BO_FLAG_PAGETABLE)) {
+		down_write(&vm->lock);
+		xe_vm_kill(vm, true);
+		up_write(&vm->lock);
+	}
+	if (vm)
+		xe_vm_put(vm);
+
+	xe_bo_lock(bo, false);
+	/*  Ban exec queue if BO is lrc */
+	if (bo->q && xe_exec_queue_get_unless_zero(bo->q)) {
+		/* ban queue */
+		xe_exec_queue_kill(bo->q);
+		xe_exec_queue_put(bo->q);
+	}
+
+	xe_bo_set_purgeable_state(bo, XE_MADV_PURGEABLE_DONTNEED);
+	ttm_bo_unmap_virtual(&bo->ttm);   /* nuke CPU mmap + VRAM IO mappings */
+	if (xe_bo_is_pinned(bo))
+		xe_bo_unpin(bo);
+	ret = xe_ttm_bo_purge(&bo->ttm, &ctx);
+	xe_bo_unlock(bo);
+
+	return ret;
+}
+
+static int xe_ttm_vram_reserve_page_at_addr(struct xe_device *xe, unsigned long addr,
+					    struct xe_ttm_vram_mgr *vram_mgr, struct gpu_buddy *mm)
+{
+	struct xe_ttm_vram_offline_resource *nentry;
+	struct ttm_buffer_object *tbo = NULL;
+	struct gpu_buddy_block *block;
+	struct gpu_buddy_block *b, *m;
+	enum reserve_status {
+		pending = 0,
+		fail
+	};
+	u64 size = SZ_4K;
+	int ret = 0;
+
+	scoped_guard(mutex, &vram_mgr->lock) {
+		block = gpu_buddy_allocated_addr_to_block(mm, addr);
+		if (PTR_ERR(block) == -ENXIO)
+			return PTR_ERR(block);
+
+		nentry = kzalloc_obj(*nentry);
+		if (!nentry)
+			return -ENOMEM;
+		INIT_LIST_HEAD(&nentry->blocks);
+		nentry->status = pending;
+		nentry->addr = addr;
+
+		if (block) {
+			struct xe_bo *pbo;
+
+			WARN_ON(!block->private);
+			tbo = block->private;
+			pbo = ttm_to_xe_bo(tbo);
+
+			/* Get reference safely - BO may have zero refcount */
+			if (!xe_bo_get_unless_zero(pbo)) {
+				kfree(nentry);
+				return -ENOENT;
+			}
+			/* Critical kernel BO? */
+			if ((pbo->ttm.type == ttm_bo_type_kernel &&
+			     !(pbo->flags & XE_BO_FLAG_PINNED_LATE_RESTORE)) ||
+			    (xe_bo_is_user(pbo) && xe_bo_is_pinned(pbo))) {
+				kfree(nentry);
+				xe_ttm_vram_free_bad_pages(&xe->drm, vram_mgr);
+				xe_bo_put(pbo);
+				drm_err(&xe->drm,
+					"%s: addr: 0x%lx is critical kernel bo, requesting SBR\n",
+					__func__, addr);
+				/* Hint System controller driver for reset with -EIO  */
+				return -EIO;
+			}
+			nentry->id = ++vram_mgr->n_queued_pages;
+			list_add(&nentry->queued_link, &vram_mgr->queued_pages);
+		}
+	}
+	if (block) {
+		struct xe_ttm_vram_offline_resource *pos, *n;
+		struct xe_bo *pbo = ttm_to_xe_bo(tbo);
+
+		/* Purge BO containing address - reference held from above */
+		ret = xe_ttm_vram_purge_page(xe, pbo);
+		xe_bo_put(pbo);
+		if (ret) {
+			nentry->status = fail;
+			return ret;
+		}
+
+		/* Reserve page at address addr*/
+		scoped_guard(mutex, &vram_mgr->lock) {
+			ret = gpu_buddy_alloc_blocks(mm, addr, addr + size,
+						     size, size, &nentry->blocks,
+						     GPU_BUDDY_RANGE_ALLOCATION);
+
+			if (ret) {
+				drm_warn(&xe->drm, "Could not reserve page at addr:0x%lx, ret:%d\n",
+					 addr, ret);
+				nentry->status = fail;
+				return ret;
+			}
+
+			list_for_each_entry_safe(b, m, &nentry->blocks, link)
+				b->private = NULL;
+
+			if ((addr + size) <= vram_mgr->visible_size) {
+				nentry->used_visible_size = size;
+			} else {
+				list_for_each_entry(b, &nentry->blocks, link) {
+					u64 start = gpu_buddy_block_offset(b);
+
+					if (start < vram_mgr->visible_size) {
+						u64 end = start + gpu_buddy_block_size(mm, b);
+
+						nentry->used_visible_size +=
+							min(end, vram_mgr->visible_size) - start;
+					}
+				}
+			}
+			vram_mgr->visible_avail -= nentry->used_visible_size;
+			list_for_each_entry_safe(pos, n, &vram_mgr->queued_pages, queued_link) {
+				if (pos->id == nentry->id) {
+					--vram_mgr->n_queued_pages;
+				list_del(&pos->queued_link);
+				break;
+				}
+			}
+			list_add(&nentry->offlined_link, &vram_mgr->offlined_pages);
+			/* RAS will send command to FW for offlining page based on ret value */
+			++vram_mgr->n_offlined_pages;
+			return ret;
+		}
+	} else {
+		scoped_guard(mutex, &vram_mgr->lock) {
+			ret = gpu_buddy_alloc_blocks(mm, addr, addr + size,
+						     size, size, &nentry->blocks,
+						     GPU_BUDDY_RANGE_ALLOCATION);
+			if (ret) {
+				drm_warn(&xe->drm, "Could not reserve page at addr:0x%lx, ret:%d\n",
+					 addr, ret);
+				nentry->status = fail;
+				return ret;
+			}
+
+			list_for_each_entry_safe(b, m, &nentry->blocks, link)
+				b->private = NULL;
+
+			if ((addr + size) <= vram_mgr->visible_size) {
+				nentry->used_visible_size = size;
+			} else {
+				struct gpu_buddy_block *block;
+
+				list_for_each_entry(block, &nentry->blocks, link) {
+					u64 start = gpu_buddy_block_offset(block);
+
+					if (start < vram_mgr->visible_size) {
+						u64 end = start + gpu_buddy_block_size(mm, block);
+
+						nentry->used_visible_size +=
+							min(end, vram_mgr->visible_size) - start;
+					}
+				}
+			}
+			vram_mgr->visible_avail -= nentry->used_visible_size;
+			nentry->id = ++vram_mgr->n_offlined_pages;
+			list_add(&nentry->offlined_link, &vram_mgr->offlined_pages);
+			/* RAS will send command to FW for offlining page based on ret value */
+		}
+	}
+	/* Success */
+	return ret;
+}
+
+static struct xe_vram_region *xe_ttm_vram_addr_to_region(struct xe_device *xe,
+							 resource_size_t addr)
+{
+	unsigned long stolen_base = xe_ttm_stolen_gpu_offset(xe);
+	struct xe_vram_region *vr;
+	struct xe_tile *tile;
+	int id;
+
+	/* Addr from stolen memory? */
+	if (addr + SZ_4K >= stolen_base)
+		return NULL;
+
+	for_each_tile(tile, xe, id) {
+		vr = tile->mem.vram;
+		if ((addr <= vr->dpa_base + vr->actual_physical_size) &&
+		    (addr + SZ_4K >= vr->dpa_base))
+			return vr;
+	}
+	return NULL;
+}
+
+/**
+ * xe_ttm_vram_handle_addr_fault - Handle vram physical address error flagged
+ * @xe: pointer to parent device
+ * @addr: physical faulty address
+ *
+ * Handle the physical faulty address error on specific tile.
+ *
+ * Returns 0 for success, negative error code otherwise.
+ */
+int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long addr)
+{
+	struct xe_ttm_vram_mgr *vram_mgr;
+	struct xe_vram_region *vr;
+	struct gpu_buddy *mm;
+	int ret;
+
+	vr = xe_ttm_vram_addr_to_region(xe, addr);
+	if (!vr) {
+		drm_err(&xe->drm, "%s:%d addr:%lx error requesting SBR\n",
+			__func__, __LINE__, addr);
+		/* Hint System controller driver for reset with -EIO  */
+		return -EIO;
+	}
+	vram_mgr = &vr->ttm;
+	mm = &vram_mgr->mm;
+
+	/* TODO: Check if we already processed faulted address, and if yes return -EEXIST */
+
+	/* Reserve page at address */
+	ret = xe_ttm_vram_reserve_page_at_addr(xe, addr, vram_mgr, mm);
+	return ret;
+}
+EXPORT_SYMBOL(xe_ttm_vram_handle_addr_fault);
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
index 87b7fae5edba..8ef06d9d44f7 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
@@ -31,6 +31,7 @@ u64 xe_ttm_vram_get_cpu_visible_size(struct ttm_resource_manager *man);
 void xe_ttm_vram_get_used(struct ttm_resource_manager *man,
 			  u64 *used, u64 *used_visible);
 
+int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long addr);
 static inline struct xe_ttm_vram_mgr_resource *
 to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
 {
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
index 9106da056b49..3ad7966798eb 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
@@ -19,6 +19,14 @@ struct xe_ttm_vram_mgr {
 	struct ttm_resource_manager manager;
 	/** @mm: DRM buddy allocator which manages the VRAM */
 	struct gpu_buddy mm;
+	/** @offlined_pages: List of offlined pages */
+	struct list_head offlined_pages;
+	/** @n_offlined_pages: Number of offlined pages */
+	u16 n_offlined_pages;
+	/** @queued_pages: List of queued pages */
+	struct list_head queued_pages;
+	/** @n_queued_pages: Number of queued pages */
+	u16 n_queued_pages;
 	/** @visible_size: Proped size of the CPU visible portion */
 	u64 visible_size;
 	/** @visible_avail: CPU visible portion still unallocated */
@@ -45,4 +53,24 @@ struct xe_ttm_vram_mgr_resource {
 	unsigned long flags;
 };
 
+/**
+ * struct xe_ttm_vram_offline_resource - Xe TTM VRAM offline  resource
+ */
+struct xe_ttm_vram_offline_resource {
+	/** @offlined_link: Link to offlined pages */
+	struct list_head offlined_link;
+	/** @queued_link: Link to queued pages */
+	struct list_head queued_link;
+	/** @blocks: list of DRM buddy blocks */
+	struct list_head blocks;
+	/** @used_visible_size: How many CPU visible bytes this resource is using */
+	u64 used_visible_size;
+	/** @id: The id of an offline resource */
+	u16 id;
+	/** @addr: Address of faulty memory location reported by HW */
+	unsigned long addr;
+	/** @status: reservation status of resource */
+	bool status;
+};
+
 #endif
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V7 07/10] drm/xe/cri: Add debugfs to inject faulty vram address
  2026-04-16  7:49 [RFC PATCH V7 00/10] Add memory page offlining support Tejas Upadhyay
                   ` (5 preceding siblings ...)
  2026-04-16  7:49 ` [RFC PATCH V7 06/10] drm/xe: Handle physical memory address error Tejas Upadhyay
@ 2026-04-16  7:49 ` Tejas Upadhyay
  2026-04-16  7:49 ` [RFC PATCH V7 08/10] gpu/buddy: Add routine to dump allocated buddy blocks Tejas Upadhyay
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Tejas Upadhyay @ 2026-04-16  7:49 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

Add debugfs which can help testing feature with manual error injection.
Adding a debugfs interface to the drm/xe driver allows manual injection
of faulty VRAM addresses, facilitating the testing of the CRI memory
page offline feature before it is fully functional. The implementation
involves creating a debugfs entry, likely under
/sys/kernel/debug/dri/bdf/invalid_addr_vram0,
to accept specific faulty addresses for validation.

For example,
echo 0 > /sys/kernel/debug/dri/bdf/invalid_addr_vram0
where 0 is below address types to be tested,
enum mempage_offline_mode {
        MEMPAGE_OFFLINE_UNALLOCATED = 0,
        MEMPAGE_OFFLINE_USER_ALLOCATED = 1,
        MEMPAGE_OFFLINE_KERNEL_USER_GGTT_ALLOCATED = 2,
        MEMPAGE_OFFLINE_KERNEL_USER_PPGTT_ALLOCATED = 3,
        MEMPAGE_OFFLINE_KERNEL_CRITICAL_ALLOCATED = 4,
        MEMPAGE_OFFLINE_RESERVED = 5
};

v4:
- Use scope_guard around lock, adapt bo->q and enhance warn messages
- %s/gpu_buddy_addr_to_block/gpu_buddy_allocated_addr_to_block
v3:
- Add more specific noncritical bo tests
v2:
- Add mode based automated test vs manual address feed

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_debugfs.c            | 171 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |   2 +
 2 files changed, 173 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/xe_debugfs.c
index c9d4484821af..ce899aa363b1 100644
--- a/drivers/gpu/drm/xe/xe_debugfs.c
+++ b/drivers/gpu/drm/xe/xe_debugfs.c
@@ -14,6 +14,7 @@
 #include "regs/xe_pmt.h"
 #include "xe_bo.h"
 #include "xe_device.h"
+#include "xe_exec_queue_types.h"
 #include "xe_force_wake.h"
 #include "xe_gt.h"
 #include "xe_gt_debugfs.h"
@@ -21,6 +22,7 @@
 #include "xe_guc_ads.h"
 #include "xe_hw_engine.h"
 #include "xe_mmio.h"
+#include "xe_migrate.h"
 #include "xe_pm.h"
 #include "xe_psmi.h"
 #include "xe_pxp_debugfs.h"
@@ -29,6 +31,8 @@
 #include "xe_sriov_vf.h"
 #include "xe_step.h"
 #include "xe_tile_debugfs.h"
+#include "xe_ttm_stolen_mgr.h"
+#include "xe_ttm_vram_mgr.h"
 #include "xe_vsec.h"
 #include "xe_wa.h"
 
@@ -40,6 +44,14 @@
 
 DECLARE_FAULT_ATTR(gt_reset_failure);
 DECLARE_FAULT_ATTR(inject_csc_hw_error);
+enum mempage_offline_mode {
+	MEMPAGE_OFFLINE_UNALLOCATED = 0,
+	MEMPAGE_OFFLINE_USER_ALLOCATED = 1,
+	MEMPAGE_OFFLINE_KERNEL_USER_GGTT_ALLOCATED = 2,
+	MEMPAGE_OFFLINE_KERNEL_USER_PPGTT_ALLOCATED = 3,
+	MEMPAGE_OFFLINE_KERNEL_CRITICAL_ALLOCATED = 4,
+	MEMPAGE_OFFLINE_RESERVED = 5,
+};
 
 static void read_residency_counter(struct xe_device *xe, struct xe_mmio *mmio,
 				   u32 offset, const char *name, struct drm_printer *p)
@@ -544,6 +556,154 @@ static const struct file_operations disable_late_binding_fops = {
 	.write = disable_late_binding_set,
 };
 
+static ssize_t addr_fault_reporting_show(struct file *f, char __user *ubuf,
+					 size_t size, loff_t *pos)
+{
+	struct xe_device *xe = file_inode(f)->i_private;
+	char buf[32];
+	int len;
+
+	len = scnprintf(buf, sizeof(buf), "%lld\n", xe->mem.vram->ttm.offline_mode);
+
+	return simple_read_from_buffer(ubuf, size, pos, buf, len);
+}
+
+static int mempage_exec_offline(struct xe_device *xe, u64 mode)
+{
+	struct xe_tile *tile = xe_device_get_root_tile(xe);
+	struct xe_vram_region *vr = tile->mem.vram;
+	struct ttm_buffer_object *tbo = NULL;
+	struct xe_ttm_vram_mgr *vram_mgr;
+	struct gpu_buddy_block *block;
+	bool do_offline = false;
+	struct gpu_buddy *mm;
+	struct xe_bo *bo;
+	u64 addr = 0x0;
+	int ret = 0;
+
+	vram_mgr = &vr->ttm;
+	mm = &vram_mgr->mm;
+	addr = vr->dpa_base;
+	while (addr <= vr->dpa_base + vr->actual_physical_size) {
+		scoped_guard(mutex, &vram_mgr->lock) {
+			block = gpu_buddy_allocated_addr_to_block(mm, addr);
+			if (!block && mode == MEMPAGE_OFFLINE_UNALLOCATED)
+				do_offline = true;
+			if (block && PTR_ERR(block) != -ENXIO) {
+				if (!block->private) {
+					addr = addr + SZ_4K;
+					do_offline = false;
+					continue;
+				}
+				tbo = block->private;
+				bo = ttm_to_xe_bo(tbo);
+				if (bo->ttm.type == ttm_bo_type_device &&
+				    bo->flags & XE_BO_FLAG_USER &&
+				    bo->flags & XE_BO_FLAG_VRAM_MASK &&
+				    mode == MEMPAGE_OFFLINE_USER_ALLOCATED) {
+					do_offline = true;
+				} else if (bo->q &&
+					   mode == MEMPAGE_OFFLINE_KERNEL_USER_GGTT_ALLOCATED) {
+					/* lrc */
+					struct xe_vm *migrate_vm;
+
+					migrate_vm = xe_migrate_get_vm(tile->migrate);
+					if (migrate_vm != bo->q->vm)
+						do_offline = true;
+					xe_vm_put(migrate_vm);
+				} else if (bo->ttm.type == ttm_bo_type_kernel &&
+					   bo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
+					   bo->flags & XE_BO_FLAG_PAGETABLE &&
+					   mode == MEMPAGE_OFFLINE_KERNEL_USER_PPGTT_ALLOCATED) {
+					/* ppgtt */
+					do_offline = true;
+				} else if (bo->ttm.type == ttm_bo_type_kernel &&
+					   !(bo->flags & XE_BO_FLAG_FORCE_USER_VRAM) &&
+					   mode == MEMPAGE_OFFLINE_KERNEL_CRITICAL_ALLOCATED) {
+					do_offline = true;
+				}
+			}
+		}
+		if (do_offline) {
+			/* Report fault */
+			ret = xe_ttm_vram_handle_addr_fault(xe, addr);
+			if (ret) {
+				if ((ret == -EIO) &&
+				    mode == MEMPAGE_OFFLINE_KERNEL_USER_GGTT_ALLOCATED) {
+					addr = addr + SZ_4K;
+					if (do_offline)
+						do_offline = false;
+					continue;
+				}
+				break;
+			}
+			/* Verify addr + SZ_4K is allocated */
+			scoped_guard(mutex, &vram_mgr->lock) {
+				block = gpu_buddy_allocated_addr_to_block(mm, addr);
+				if (!block || PTR_ERR(block) == -ENXIO || block->private)
+					ret = -EBUSY;
+			}
+			break;
+		}
+		addr = addr + SZ_4K;
+		if (do_offline)
+			do_offline = false;
+	}
+	if (!do_offline)
+		drm_warn(&xe->drm, "no such object, ret:%d\n", ret);
+
+	return ret;
+}
+
+static ssize_t addr_fault_reporting_set(struct file *f, const char __user *ubuf,
+					size_t size, loff_t *pos)
+{
+	struct xe_device *xe = file_inode(f)->i_private;
+	int ret = 0;
+	u64 mode;
+
+	ret = kstrtou64_from_user(ubuf, size, 0, &mode);
+	if (ret)
+		return ret;
+
+	switch (mode) {
+	case MEMPAGE_OFFLINE_UNALLOCATED:
+	case MEMPAGE_OFFLINE_USER_ALLOCATED:
+	case MEMPAGE_OFFLINE_KERNEL_USER_GGTT_ALLOCATED:
+	case MEMPAGE_OFFLINE_KERNEL_USER_PPGTT_ALLOCATED:
+	case MEMPAGE_OFFLINE_KERNEL_CRITICAL_ALLOCATED:
+		ret = mempage_exec_offline(xe, mode);
+		break;
+	case MEMPAGE_OFFLINE_RESERVED:
+		u64 stolen_base;
+
+		stolen_base = xe_ttm_stolen_gpu_offset(xe);
+		ret = xe_ttm_vram_handle_addr_fault(xe, stolen_base);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	xe->mem.vram->ttm.offline_mode = mode;
+	if (!ret || (ret == -EIO &&
+		     (mode == MEMPAGE_OFFLINE_KERNEL_CRITICAL_ALLOCATED ||
+		      mode == MEMPAGE_OFFLINE_RESERVED))) {
+		drm_info(&xe->drm, "offline mode %llu passed ret:%d\n", mode, ret);
+	} else {
+		drm_warn(&xe->drm, "offline mode %llu failed, ret:%d\n", mode, ret);
+		return ret;
+	}
+
+	return size;
+}
+
+static const struct file_operations addr_fault_reporting_fops = {
+	.owner = THIS_MODULE,
+	.read = addr_fault_reporting_show,
+	.write = addr_fault_reporting_set,
+};
+
 void xe_debugfs_register(struct xe_device *xe)
 {
 	struct ttm_device *bdev = &xe->ttm;
@@ -600,6 +760,17 @@ void xe_debugfs_register(struct xe_device *xe)
 	if (man)
 		ttm_resource_manager_create_debugfs(man, root, "stolen_mm");
 
+	if (xe->info.platform == XE_CRESCENTISLAND) {
+		man = ttm_manager_type(bdev, XE_PL_VRAM0);
+		if (man) {
+			char name[20];
+
+			snprintf(name, sizeof(name), "invalid_addr_vram%d", 0);
+			debugfs_create_file(name, 0600, root, xe,
+					    &addr_fault_reporting_fops);
+		}
+	}
+
 	for_each_tile(tile, xe, tile_id)
 		xe_tile_debugfs_register(tile);
 
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
index 3ad7966798eb..07ed88b47e04 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
@@ -37,6 +37,8 @@ struct xe_ttm_vram_mgr {
 	struct mutex lock;
 	/** @mem_type: The TTM memory type */
 	u32 mem_type;
+	/** @offline_mode: debugfs hook for setting page offline mode */
+	u64 offline_mode;
 };
 
 /**
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V7 08/10] gpu/buddy: Add routine to dump allocated buddy blocks
  2026-04-16  7:49 [RFC PATCH V7 00/10] Add memory page offlining support Tejas Upadhyay
                   ` (6 preceding siblings ...)
  2026-04-16  7:49 ` [RFC PATCH V7 07/10] drm/xe/cri: Add debugfs to inject faulty vram address Tejas Upadhyay
@ 2026-04-16  7:49 ` Tejas Upadhyay
  2026-04-16  7:49 ` [RFC PATCH V7 09/10] drm/xe/configfs: Add vram bad page reservation policy Tejas Upadhyay
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Tejas Upadhyay @ 2026-04-16  7:49 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

To implement the ability to see allocated blocks under a specific VRAM
instance in the drm driver, new api is introduced. While existing structs
often show the free block list, this addition provides a comprehensive view
of all currently resident VRAM allocations.

Dump will look like,

[  +0.000003] xe 0000:03:00.0: [drm] 0x00000002f8000000-0x00000002f8800000: 8388608
[  +0.000005] xe 0000:03:00.0: [drm] 0x00000002f8800000-0x00000002f8840000: 262144
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f8840000-0x00000002f8860000: 131072
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f8860000-0x00000002f8870000: 65536
[  +0.000005] xe 0000:03:00.0: [drm] 0x00000002f9000000-0x00000002f9800000: 8388608
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f9800000-0x00000002f9880000: 524288
[  +0.000005] xe 0000:03:00.0: [drm] 0x00000002f9880000-0x00000002f9884000: 16384
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f9900000-0x00000002f9980000: 524288
[  +0.000005] xe 0000:03:00.0: [drm] 0x00000002f9980000-0x00000002f9988000: 32768
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f9988000-0x00000002f998c000: 16384

v2(MattB):
- Add lockdep assert

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/buddy.c       | 43 +++++++++++++++++++++++++++++++++++++++
 include/linux/gpu_buddy.h |  1 +
 2 files changed, 44 insertions(+)

diff --git a/drivers/gpu/buddy.c b/drivers/gpu/buddy.c
index 44bf926aca56..09aa4a33f928 100644
--- a/drivers/gpu/buddy.c
+++ b/drivers/gpu/buddy.c
@@ -10,6 +10,7 @@
 #include <linux/sizes.h>
 
 #include <linux/gpu_buddy.h>
+#include <drm/drm_print.h>
 
 /**
  * gpu_buddy_assert - assert a condition in the buddy allocator
@@ -1295,6 +1296,48 @@ int gpu_buddy_block_trim(struct gpu_buddy *mm,
 }
 EXPORT_SYMBOL(gpu_buddy_block_trim);
 
+/**
+ * gpu_buddy_dump_allocated_blocks - print all allocated blocks in drm buddy
+ *
+ * @mm: DRM buddy manager to look into
+ *
+ * Looks into buddy manager for each block and their status and if allocated
+ * print allocated block range and size
+ *
+ * Returns:
+ * void
+ */
+void gpu_buddy_dump_allocated_blocks(struct gpu_buddy *mm)
+{
+	struct gpu_buddy_block *block;
+	LIST_HEAD(dfs);
+	int i;
+
+	gpu_buddy_driver_lock_held(mm);
+
+	for (i = 0; i < mm->n_roots; ++i)
+		list_add_tail(&mm->roots[i]->tmp_link, &dfs);
+
+	do {
+		block = list_first_entry_or_null(&dfs,
+						 struct gpu_buddy_block,
+						 tmp_link);
+		if (!block)
+			break;
+
+		list_del(&block->tmp_link);
+
+		if (gpu_buddy_block_is_allocated(block))
+			gpu_buddy_block_print(mm, block);
+
+		if (gpu_buddy_block_is_split(block)) {
+			list_add(&block->right->tmp_link, &dfs);
+			list_add(&block->left->tmp_link, &dfs);
+		}
+	} while (1);
+}
+EXPORT_SYMBOL(gpu_buddy_dump_allocated_blocks);
+
 static struct gpu_buddy_block *
 __gpu_buddy_alloc_blocks(struct gpu_buddy *mm,
 			 u64 start, u64 end,
diff --git a/include/linux/gpu_buddy.h b/include/linux/gpu_buddy.h
index e42c38a2a9f5..2834ea06da56 100644
--- a/include/linux/gpu_buddy.h
+++ b/include/linux/gpu_buddy.h
@@ -267,6 +267,7 @@ int gpu_buddy_block_trim(struct gpu_buddy *mm,
 			 u64 *start,
 			 u64 new_size,
 			 struct list_head *blocks);
+void gpu_buddy_dump_allocated_blocks(struct gpu_buddy *mm);
 
 void gpu_buddy_reset_clear(struct gpu_buddy *mm, bool is_clear);
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V7 09/10] drm/xe/configfs: Add vram bad page reservation policy
  2026-04-16  7:49 [RFC PATCH V7 00/10] Add memory page offlining support Tejas Upadhyay
                   ` (7 preceding siblings ...)
  2026-04-16  7:49 ` [RFC PATCH V7 08/10] gpu/buddy: Add routine to dump allocated buddy blocks Tejas Upadhyay
@ 2026-04-16  7:49 ` Tejas Upadhyay
  2026-04-16  7:49 ` [RFC PATCH V7 10/10] drm/xe/cri: Add sysfs interface for bad gpu vram pages Tejas Upadhyay
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Tejas Upadhyay @ 2026-04-16  7:49 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

The interface enables setting the policy for how bad pages are
handled in VRAM. This is crucial for maintaining system
stability in scenarios where VRAM degradation occurs.

By default policy will be "reserve", which can be changed to
"logging" only.

v3:
- All FW communication moved under RAS
v2:
- Add CRI check and rebase

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_configfs.c     | 64 +++++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_configfs.h     |  2 +
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 10 +++++
 3 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_configfs.c b/drivers/gpu/drm/xe/xe_configfs.c
index 32102600a148..e07a6a74896b 100644
--- a/drivers/gpu/drm/xe/xe_configfs.c
+++ b/drivers/gpu/drm/xe/xe_configfs.c
@@ -61,7 +61,8 @@
  *	    ├── survivability_mode
  *	    ├── gt_types_allowed
  *	    ├── engines_allowed
- *	    └── enable_psmi
+ *          ├── enable_psmi
+ *          └── bad_page_reservation
  *
  * After configuring the attributes as per next section, the device can be
  * probed with::
@@ -159,6 +160,16 @@
  *
  * This attribute can only be set before binding to the device.
  *
+ * Bad pages reservation:
+ * ---------------------
+ *
+ * Disable vram bad pages reservation, instead just report it in dmesg.
+ *  Example to disable it::
+ *
+ *      # echo 0 > /sys/kernel/config/xe/0000:03:00.0/bad_page_reservation
+ *
+ * This attribute can only be set before binding to the device.
+ *
  * Context restore BB
  * ------------------
  *
@@ -262,6 +273,7 @@ struct xe_config_group_device {
 		struct wa_bb ctx_restore_mid_bb[XE_ENGINE_CLASS_MAX];
 		bool survivability_mode;
 		bool enable_psmi;
+		bool bad_page_reservation;
 		struct {
 			unsigned int max_vfs;
 			bool admin_only_pf;
@@ -281,6 +293,7 @@ static const struct xe_config_device device_defaults = {
 	.engines_allowed = U64_MAX,
 	.survivability_mode = false,
 	.enable_psmi = false,
+	.bad_page_reservation = true,
 	.sriov = {
 		.max_vfs = XE_DEFAULT_MAX_VFS,
 		.admin_only_pf = XE_DEFAULT_ADMIN_ONLY_PF,
@@ -575,6 +588,32 @@ static ssize_t enable_psmi_store(struct config_item *item, const char *page, siz
 	return len;
 }
 
+static ssize_t bad_page_reservation_show(struct config_item *item, char *page)
+{
+	struct xe_config_device *dev = to_xe_config_device(item);
+
+	return sprintf(page, "%d\n", dev->bad_page_reservation);
+}
+
+static ssize_t bad_page_reservation_store(struct config_item *item, const char *page, size_t len)
+{
+	struct xe_config_group_device *dev = to_xe_config_group_device(item);
+	bool val;
+	int ret;
+
+	ret = kstrtobool(page, &val);
+	if (ret)
+		return ret;
+
+	guard(mutex)(&dev->lock);
+	if (is_bound(dev))
+		return -EBUSY;
+
+	dev->config.bad_page_reservation = val;
+
+	return len;
+}
+
 static bool wa_bb_read_advance(bool dereference, char **p,
 			       const char *append, size_t len,
 			       size_t *max_size)
@@ -813,6 +852,7 @@ static ssize_t ctx_restore_post_bb_store(struct config_item *item,
 CONFIGFS_ATTR(, ctx_restore_mid_bb);
 CONFIGFS_ATTR(, ctx_restore_post_bb);
 CONFIGFS_ATTR(, enable_psmi);
+CONFIGFS_ATTR(, bad_page_reservation);
 CONFIGFS_ATTR(, engines_allowed);
 CONFIGFS_ATTR(, gt_types_allowed);
 CONFIGFS_ATTR(, survivability_mode);
@@ -821,6 +861,7 @@ static struct configfs_attribute *xe_config_device_attrs[] = {
 	&attr_ctx_restore_mid_bb,
 	&attr_ctx_restore_post_bb,
 	&attr_enable_psmi,
+	&attr_bad_page_reservation,
 	&attr_engines_allowed,
 	&attr_gt_types_allowed,
 	&attr_survivability_mode,
@@ -1098,6 +1139,7 @@ static void dump_custom_dev_config(struct pci_dev *pdev,
 	PRI_CUSTOM_ATTR("%llx", gt_types_allowed);
 	PRI_CUSTOM_ATTR("%llx", engines_allowed);
 	PRI_CUSTOM_ATTR("%d", enable_psmi);
+	PRI_CUSTOM_ATTR("%d", bad_page_reservation);
 	PRI_CUSTOM_ATTR("%d", survivability_mode);
 	PRI_CUSTOM_ATTR("%u", sriov.admin_only_pf);
 
@@ -1225,6 +1267,26 @@ bool xe_configfs_get_psmi_enabled(struct pci_dev *pdev)
 	return ret;
 }
 
+/**
+ * xe_configfs_get_bad_page_reservation - get configfs bad_page_reservation setting
+ * @pdev: pci device
+ *
+ * Return: bad_page_reservation setting in configfs
+ */
+bool xe_configfs_get_bad_page_reservation(struct pci_dev *pdev)
+{
+	struct xe_config_group_device *dev = find_xe_config_group_device(pdev);
+	bool ret;
+
+	if (!dev)
+		return device_defaults.bad_page_reservation;
+
+	ret = dev->config.bad_page_reservation;
+	config_group_put(&dev->group);
+
+	return ret;
+}
+
 /**
  * xe_configfs_get_ctx_restore_mid_bb - get configfs ctx_restore_mid_bb setting
  * @pdev: pci device
diff --git a/drivers/gpu/drm/xe/xe_configfs.h b/drivers/gpu/drm/xe/xe_configfs.h
index 07d62bf0c152..c107d84b2c62 100644
--- a/drivers/gpu/drm/xe/xe_configfs.h
+++ b/drivers/gpu/drm/xe/xe_configfs.h
@@ -23,6 +23,7 @@ bool xe_configfs_primary_gt_allowed(struct pci_dev *pdev);
 bool xe_configfs_media_gt_allowed(struct pci_dev *pdev);
 u64 xe_configfs_get_engines_allowed(struct pci_dev *pdev);
 bool xe_configfs_get_psmi_enabled(struct pci_dev *pdev);
+bool xe_configfs_get_bad_page_reservation(struct pci_dev *pdev);
 u32 xe_configfs_get_ctx_restore_mid_bb(struct pci_dev *pdev,
 				       enum xe_engine_class class,
 				       const u32 **cs);
@@ -42,6 +43,7 @@ static inline bool xe_configfs_primary_gt_allowed(struct pci_dev *pdev) { return
 static inline bool xe_configfs_media_gt_allowed(struct pci_dev *pdev) { return true; }
 static inline u64 xe_configfs_get_engines_allowed(struct pci_dev *pdev) { return U64_MAX; }
 static inline bool xe_configfs_get_psmi_enabled(struct pci_dev *pdev) { return false; }
+static inline bool xe_configfs_get_bad_page_reservation(struct pci_dev *pdev) { return true; }
 static inline u32 xe_configfs_get_ctx_restore_mid_bb(struct pci_dev *pdev,
 						     enum xe_engine_class class,
 						     const u32 **cs) { return 0; }
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index fcf32360f240..7f58e7e8c3e1 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -12,6 +12,7 @@
 #include <drm/ttm/ttm_range_manager.h>
 
 #include "xe_bo.h"
+#include "xe_configfs.h"
 #include "xe_device.h"
 #include "xe_exec_queue.h"
 #include "xe_lrc.h"
@@ -731,6 +732,7 @@ int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long addr)
 	struct xe_ttm_vram_mgr *vram_mgr;
 	struct xe_vram_region *vr;
 	struct gpu_buddy *mm;
+	bool policy;
 	int ret;
 
 	vr = xe_ttm_vram_addr_to_region(xe, addr);
@@ -745,6 +747,14 @@ int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long addr)
 
 	/* TODO: Check if we already processed faulted address, and if yes return -EEXIST */
 
+	policy = xe_configfs_get_bad_page_reservation(to_pci_dev(xe->drm.dev));
+	if (!policy) {
+		drm_err(&xe->drm, "0x%lx is reported as corrupted address by HW\n",
+			addr);
+		/* Let RAS report to FW to drop addr from SRAM queue */
+		return -EOPNOTSUPP;
+	}
+
 	/* Reserve page at address */
 	ret = xe_ttm_vram_reserve_page_at_addr(xe, addr, vram_mgr, mm);
 	return ret;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH V7 10/10] drm/xe/cri: Add sysfs interface for bad gpu vram pages
  2026-04-16  7:49 [RFC PATCH V7 00/10] Add memory page offlining support Tejas Upadhyay
                   ` (8 preceding siblings ...)
  2026-04-16  7:49 ` [RFC PATCH V7 09/10] drm/xe/configfs: Add vram bad page reservation policy Tejas Upadhyay
@ 2026-04-16  7:49 ` Tejas Upadhyay
  2026-04-16  7:56 ` ✗ CI.checkpatch: warning for Add memory page offlining support (rev8) Patchwork
  2026-04-16  7:57 ` ✗ CI.KUnit: failure " Patchwork
  11 siblings, 0 replies; 19+ messages in thread
From: Tejas Upadhyay @ 2026-04-16  7:49 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

Starting CRI, Include a sysfs interface designed to expose information
about bad VRAM pages—those identified as having hardware faults
(e.g., ECC errors). This interface allows userspace tools and
administrators to monitor the health of the GPU's local memory and
track the status of page retirement.To get details on bad gpu vram
pages can be found under /sys/bus/pci/devices/bdf/vram_bad_pages.

Where The format is, pfn : gpu page size : flags

flags:
R: reserved, this gpu page is reserved.
P: pending for reserve, this gpu page is marked as bad, will be reserved
   in next window of page_reserve.
F: unable to reserve. this gpu page can’t be reserved due to some reasons.

For example if you read using cat /sys/bus/pci/devices/bdf/vram_bad_pages,
max_pages : 10000
0x00000000 : 0x00001000 : R
0x00001234 : 0x00001000 : P

v3:
- Move FW communication in RAS code
v2:
- Add max_pages info as per updated design doc
- Rebase

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_device_sysfs.c       |  7 ++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       | 79 ++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.h       |  1 +
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  2 +
 4 files changed, 89 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device_sysfs.c b/drivers/gpu/drm/xe/xe_device_sysfs.c
index a73e0e957cb0..47c5be4180fe 100644
--- a/drivers/gpu/drm/xe/xe_device_sysfs.c
+++ b/drivers/gpu/drm/xe/xe_device_sysfs.c
@@ -8,12 +8,14 @@
 #include <linux/pci.h>
 #include <linux/sysfs.h>
 
+#include "xe_configfs.h"
 #include "xe_device.h"
 #include "xe_device_sysfs.h"
 #include "xe_mmio.h"
 #include "xe_pcode_api.h"
 #include "xe_pcode.h"
 #include "xe_pm.h"
+#include "xe_ttm_vram_mgr.h"
 
 /**
  * DOC: Xe device sysfs
@@ -267,6 +269,7 @@ static const struct attribute_group auto_link_downgrade_attr_group = {
 int xe_device_sysfs_init(struct xe_device *xe)
 {
 	struct device *dev = xe->drm.dev;
+	bool policy;
 	int ret;
 
 	if (xe->d3cold.capable) {
@@ -285,5 +288,9 @@ int xe_device_sysfs_init(struct xe_device *xe)
 			return ret;
 	}
 
+	policy = xe_configfs_get_bad_page_reservation(to_pci_dev(dev));
+	if (xe->info.platform == XE_CRESCENTISLAND && policy)
+		xe_ttm_vram_sysfs_init(xe);
+
 	return 0;
 }
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index 7f58e7e8c3e1..611d945c9eb4 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -760,3 +760,82 @@ int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long addr)
 	return ret;
 }
 EXPORT_SYMBOL(xe_ttm_vram_handle_addr_fault);
+
+static void xe_ttm_vram_dump_bad_pages_info(char *buf, struct xe_ttm_vram_mgr *mgr)
+{
+	const unsigned int element_size = sizeof("0xabcdabcd : 0x12345678 : R\n") - 1;
+	const unsigned int maxpage_size = sizeof("max_pages: 10000\n") - 1;
+	struct xe_ttm_vram_offline_resource *pos, *n;
+	struct gpu_buddy_block *block;
+	ssize_t s = 0;
+
+	mutex_lock(&mgr->lock);
+	s += scnprintf(&buf[s], maxpage_size + 1, "max_pages: %d\n", mgr->max_pages);
+	list_for_each_entry_safe(pos, n, &mgr->offlined_pages, offlined_link) {
+		block = list_first_entry(&pos->blocks,
+					 struct gpu_buddy_block,
+					 link);
+		s += scnprintf(&buf[s], element_size + 1,
+			       "0x%08llx : 0x%08llx : %1s\n",
+			       gpu_buddy_block_offset(block) >> PAGE_SHIFT,
+			       gpu_buddy_block_size(&mgr->mm, block),
+			       "R");
+	}
+	list_for_each_entry_safe(pos, n, &mgr->queued_pages, queued_link) {
+		block = list_first_entry(&pos->blocks,
+					 struct gpu_buddy_block,
+					 link);
+		s += scnprintf(&buf[s], element_size + 1,
+			       "0x%08llx : 0x%08llx : %1s\n",
+			       gpu_buddy_block_offset(block) >> PAGE_SHIFT,
+			       gpu_buddy_block_size(&mgr->mm, block),
+			       pos->status ? "P" : "F");
+	}
+	mutex_unlock(&mgr->lock);
+}
+
+static ssize_t vram_bad_pages_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct xe_device *xe = pdev_to_xe_device(pdev);
+	struct ttm_resource_manager *man;
+	struct xe_ttm_vram_mgr *mgr;
+
+	man = ttm_manager_type(&xe->ttm, XE_PL_VRAM0);
+	if (man) {
+		mgr = to_xe_ttm_vram_mgr(man);
+		xe_ttm_vram_dump_bad_pages_info(buf, mgr);
+	}
+
+	return sysfs_emit(buf, "%s\n", buf);
+}
+static DEVICE_ATTR_RO(vram_bad_pages);
+
+static void xe_ttm_vram_sysfs_fini(void *arg)
+{
+	struct xe_device *xe = arg;
+
+	device_remove_file(xe->drm.dev, &dev_attr_vram_bad_pages);
+}
+
+/**
+ * xe_ttm_vram_sysfs_init - Initialize vram sysfs component
+ * @tile: Xe Tile object
+ *
+ * It needs to be initialized after the main tile component is ready
+ *
+ * Returns: 0 on success, negative error code on error.
+ */
+int xe_ttm_vram_sysfs_init(struct xe_device *xe)
+{
+	int err;
+
+	err = device_create_file(xe->drm.dev, &dev_attr_vram_bad_pages);
+	if (err) {
+		dev_err(xe->drm.dev, "Failed to create vram_bad_pages sysfs file: %d\n", err);
+		return 0;
+	}
+
+	return devm_add_action_or_reset(xe->drm.dev, xe_ttm_vram_sysfs_fini, xe);
+}
+EXPORT_SYMBOL(xe_ttm_vram_sysfs_init);
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
index 8ef06d9d44f7..c33e1a8d9217 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
@@ -32,6 +32,7 @@ void xe_ttm_vram_get_used(struct ttm_resource_manager *man,
 			  u64 *used, u64 *used_visible);
 
 int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long addr);
+int xe_ttm_vram_sysfs_init(struct xe_device *xe);
 static inline struct xe_ttm_vram_mgr_resource *
 to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
 {
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
index 07ed88b47e04..b23796066a1a 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
@@ -39,6 +39,8 @@ struct xe_ttm_vram_mgr {
 	u32 mem_type;
 	/** @offline_mode: debugfs hook for setting page offline mode */
 	u64 offline_mode;
+	/** @max_pages: max pages that can be in offline queue retrieved from FW */
+	u16 max_pages;
 };
 
 /**
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* ✗ CI.checkpatch: warning for Add memory page offlining support (rev8)
  2026-04-16  7:49 [RFC PATCH V7 00/10] Add memory page offlining support Tejas Upadhyay
                   ` (9 preceding siblings ...)
  2026-04-16  7:49 ` [RFC PATCH V7 10/10] drm/xe/cri: Add sysfs interface for bad gpu vram pages Tejas Upadhyay
@ 2026-04-16  7:56 ` Patchwork
  2026-04-16  7:57 ` ✗ CI.KUnit: failure " Patchwork
  11 siblings, 0 replies; 19+ messages in thread
From: Patchwork @ 2026-04-16  7:56 UTC (permalink / raw)
  To: Tejas Upadhyay; +Cc: intel-xe

== Series Details ==

Series: Add memory page offlining support (rev8)
URL   : https://patchwork.freedesktop.org/series/161473/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
1f57ba1afceae32108bd24770069f764d940a0e4
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit cddb2434942e3dc976b828dca2b17d73fb7d6b68
Author: Tejas Upadhyay <tejas.upadhyay@intel.com>
Date:   Thu Apr 16 13:19:59 2026 +0530

    drm/xe/cri: Add sysfs interface for bad gpu vram pages
    
    Starting CRI, Include a sysfs interface designed to expose information
    about bad VRAM pages—those identified as having hardware faults
    (e.g., ECC errors). This interface allows userspace tools and
    administrators to monitor the health of the GPU's local memory and
    track the status of page retirement.To get details on bad gpu vram
    pages can be found under /sys/bus/pci/devices/bdf/vram_bad_pages.
    
    Where The format is, pfn : gpu page size : flags
    
    flags:
    R: reserved, this gpu page is reserved.
    P: pending for reserve, this gpu page is marked as bad, will be reserved
       in next window of page_reserve.
    F: unable to reserve. this gpu page can’t be reserved due to some reasons.
    
    For example if you read using cat /sys/bus/pci/devices/bdf/vram_bad_pages,
    max_pages : 10000
    0x00000000 : 0x00001000 : R
    0x00001234 : 0x00001000 : P
    
    v3:
    - Move FW communication in RAS code
    v2:
    - Add max_pages info as per updated design doc
    - Rebase
    
    Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
+ /mt/dim checkpatch 7e772296e93b337ca5e12fced2ac12716a159d28 drm-intel
7ad853ed2074 drm/xe: Link VRAM object with gpu buddy
9be10af07d2b gpu/buddy: Integrate lockdep for gpu buddy manager
cfbf17fee984 drm/gpu: Add gpu_buddy_allocated_addr_to_block helper
8e7bb33194ec drm/xe: Link LRC BO and its execution Queue
86665af845df drm/xe: Extend BO purge to handle vram pages as well
c6ca0861ed2a drm/xe: Handle physical memory address error
-:13: ERROR:BAD_COMMIT_SEPARATOR: Invalid commit separator - some tools may have problems applying this
#13: 
----------------------

-:20: ERROR:BAD_COMMIT_SEPARATOR: Invalid commit separator - some tools may have problems applying this
#20: 
-----------------

total: 2 errors, 0 warnings, 0 checks, 418 lines checked
0aa34b8cb1c5 drm/xe/cri: Add debugfs to inject faulty vram address
b73669f195ce gpu/buddy: Add routine to dump allocated buddy blocks
-:13: WARNING:COMMIT_LOG_LONG_LINE: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#13: 
[  +0.000003] xe 0000:03:00.0: [drm] 0x00000002f8000000-0x00000002f8800000: 8388608

total: 0 errors, 1 warnings, 0 checks, 62 lines checked
a50fc40a0ce1 drm/xe/configfs: Add vram bad page reservation policy
cddb2434942e drm/xe/cri: Add sysfs interface for bad gpu vram pages
-:22: WARNING:COMMIT_LOG_LONG_LINE: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#22: 
F: unable to reserve. this gpu page can’t be reserved due to some reasons.

total: 0 errors, 1 warnings, 0 checks, 127 lines checked



^ permalink raw reply	[flat|nested] 19+ messages in thread

* ✗ CI.KUnit: failure for Add memory page offlining support (rev8)
  2026-04-16  7:49 [RFC PATCH V7 00/10] Add memory page offlining support Tejas Upadhyay
                   ` (10 preceding siblings ...)
  2026-04-16  7:56 ` ✗ CI.checkpatch: warning for Add memory page offlining support (rev8) Patchwork
@ 2026-04-16  7:57 ` Patchwork
  11 siblings, 0 replies; 19+ messages in thread
From: Patchwork @ 2026-04-16  7:57 UTC (permalink / raw)
  To: Tejas Upadhyay; +Cc: intel-xe

== Series Details ==

Series: Add memory page offlining support (rev8)
URL   : https://patchwork.freedesktop.org/series/161473/
State : failure

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
[07:56:07] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[07:56:11] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[07:56:42] Starting KUnit Kernel (1/1)...
[07:56:42] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[07:56:42] ================== guc_buf (11 subtests) ===================
[07:56:42] [PASSED] test_smallest
[07:56:42] [PASSED] test_largest
[07:56:42] [PASSED] test_granular
[07:56:42] [PASSED] test_unique
[07:56:42] [PASSED] test_overlap
[07:56:42] [PASSED] test_reusable
[07:56:42] [PASSED] test_too_big
[07:56:42] [PASSED] test_flush
[07:56:42] [PASSED] test_lookup
[07:56:42] [PASSED] test_data
[07:56:42] [PASSED] test_class
[07:56:42] ===================== [PASSED] guc_buf =====================
[07:56:42] =================== guc_dbm (7 subtests) ===================
[07:56:42] [PASSED] test_empty
[07:56:42] [PASSED] test_default
[07:56:42] ======================== test_size  ========================
[07:56:42] [PASSED] 4
[07:56:42] [PASSED] 8
[07:56:42] [PASSED] 32
[07:56:42] [PASSED] 256
[07:56:42] ==================== [PASSED] test_size ====================
[07:56:42] ======================= test_reuse  ========================
[07:56:42] [PASSED] 4
[07:56:42] [PASSED] 8
[07:56:42] [PASSED] 32
[07:56:42] [PASSED] 256
[07:56:42] =================== [PASSED] test_reuse ====================
[07:56:42] =================== test_range_overlap  ====================
[07:56:42] [PASSED] 4
[07:56:42] [PASSED] 8
[07:56:42] [PASSED] 32
[07:56:42] [PASSED] 256
[07:56:42] =============== [PASSED] test_range_overlap ================
[07:56:42] =================== test_range_compact  ====================
[07:56:42] [PASSED] 4
[07:56:42] [PASSED] 8
[07:56:42] [PASSED] 32
[07:56:42] [PASSED] 256
[07:56:42] =============== [PASSED] test_range_compact ================
[07:56:42] ==================== test_range_spare  =====================
[07:56:42] [PASSED] 4
[07:56:42] [PASSED] 8
[07:56:42] [PASSED] 32
[07:56:42] [PASSED] 256
[07:56:42] ================ [PASSED] test_range_spare =================
[07:56:42] ===================== [PASSED] guc_dbm =====================
[07:56:42] =================== guc_idm (6 subtests) ===================
[07:56:42] [PASSED] bad_init
[07:56:42] [PASSED] no_init
[07:56:42] [PASSED] init_fini
[07:56:42] [PASSED] check_used
[07:56:42] [PASSED] check_quota
[07:56:42] [PASSED] check_all
[07:56:42] ===================== [PASSED] guc_idm =====================
[07:56:42] ================== no_relay (3 subtests) ===================
[07:56:42] [PASSED] xe_drops_guc2pf_if_not_ready
[07:56:42] [PASSED] xe_drops_guc2vf_if_not_ready
[07:56:42] [PASSED] xe_rejects_send_if_not_ready
[07:56:42] ==================== [PASSED] no_relay =====================
[07:56:42] ================== pf_relay (14 subtests) ==================
[07:56:42] [PASSED] pf_rejects_guc2pf_too_short
[07:56:42] [PASSED] pf_rejects_guc2pf_too_long
[07:56:42] [PASSED] pf_rejects_guc2pf_no_payload
[07:56:42] [PASSED] pf_fails_no_payload
[07:56:42] [PASSED] pf_fails_bad_origin
[07:56:42] [PASSED] pf_fails_bad_type
[07:56:42] [PASSED] pf_txn_reports_error
[07:56:42] [PASSED] pf_txn_sends_pf2guc
[07:56:42] [PASSED] pf_sends_pf2guc
[07:56:42] [SKIPPED] pf_loopback_nop
[07:56:42] [SKIPPED] pf_loopback_echo
[07:56:42] [SKIPPED] pf_loopback_fail
[07:56:42] [SKIPPED] pf_loopback_busy
[07:56:42] [SKIPPED] pf_loopback_retry
[07:56:42] ==================== [PASSED] pf_relay =====================
[07:56:42] ================== vf_relay (3 subtests) ===================
[07:56:42] [PASSED] vf_rejects_guc2vf_too_short
[07:56:42] [PASSED] vf_rejects_guc2vf_too_long
[07:56:42] [PASSED] vf_rejects_guc2vf_no_payload
[07:56:42] ==================== [PASSED] vf_relay =====================
[07:56:42] ================ pf_gt_config (9 subtests) =================
[07:56:42] [PASSED] fair_contexts_1vf
[07:56:42] [PASSED] fair_doorbells_1vf
[07:56:42] [PASSED] fair_ggtt_1vf
[07:56:42] ====================== fair_vram_1vf  ======================
[07:56:42] [PASSED] 3.50 GiB
[07:56:42] [PASSED] 11.5 GiB
[07:56:42] [PASSED] 15.5 GiB
[07:56:42] [PASSED] 31.5 GiB
[07:56:42] [PASSED] 63.5 GiB
[07:56:42] [PASSED] 1.91 GiB
[07:56:42] ================== [PASSED] fair_vram_1vf ==================
[07:56:42] ================ fair_vram_1vf_admin_only  =================
[07:56:42] [PASSED] 3.50 GiB
[07:56:42] [PASSED] 11.5 GiB
[07:56:42] [PASSED] 15.5 GiB
[07:56:42] [PASSED] 31.5 GiB
[07:56:42] [PASSED] 63.5 GiB
[07:56:42] [PASSED] 1.91 GiB
[07:56:42] ============ [PASSED] fair_vram_1vf_admin_only =============
[07:56:42] ====================== fair_contexts  ======================
[07:56:42] [PASSED] 1 VF
[07:56:42] [PASSED] 2 VFs
[07:56:42] [PASSED] 3 VFs
[07:56:42] [PASSED] 4 VFs
[07:56:42] [PASSED] 5 VFs
[07:56:42] [PASSED] 6 VFs
[07:56:42] [PASSED] 7 VFs
[07:56:42] [PASSED] 8 VFs
[07:56:42] [PASSED] 9 VFs
[07:56:42] [PASSED] 10 VFs
[07:56:42] [PASSED] 11 VFs
[07:56:42] [PASSED] 12 VFs
[07:56:42] [PASSED] 13 VFs
[07:56:42] [PASSED] 14 VFs
[07:56:42] [PASSED] 15 VFs
[07:56:42] [PASSED] 16 VFs
[07:56:42] [PASSED] 17 VFs
[07:56:42] [PASSED] 18 VFs
[07:56:42] [PASSED] 19 VFs
[07:56:42] [PASSED] 20 VFs
[07:56:42] [PASSED] 21 VFs
[07:56:42] [PASSED] 22 VFs
[07:56:42] [PASSED] 23 VFs
[07:56:42] [PASSED] 24 VFs
[07:56:42] [PASSED] 25 VFs
[07:56:42] [PASSED] 26 VFs
[07:56:42] [PASSED] 27 VFs
[07:56:42] [PASSED] 28 VFs
[07:56:42] [PASSED] 29 VFs
[07:56:42] [PASSED] 30 VFs
[07:56:42] [PASSED] 31 VFs
[07:56:42] [PASSED] 32 VFs
[07:56:42] [PASSED] 33 VFs
[07:56:42] [PASSED] 34 VFs
[07:56:42] [PASSED] 35 VFs
[07:56:42] [PASSED] 36 VFs
[07:56:42] [PASSED] 37 VFs
[07:56:42] [PASSED] 38 VFs
[07:56:42] [PASSED] 39 VFs
[07:56:42] [PASSED] 40 VFs
[07:56:42] [PASSED] 41 VFs
[07:56:42] [PASSED] 42 VFs
[07:56:42] [PASSED] 43 VFs
[07:56:42] [PASSED] 44 VFs
[07:56:42] [PASSED] 45 VFs
[07:56:42] [PASSED] 46 VFs
[07:56:42] [PASSED] 47 VFs
[07:56:42] [PASSED] 48 VFs
[07:56:42] [PASSED] 49 VFs
[07:56:42] [PASSED] 50 VFs
[07:56:42] [PASSED] 51 VFs
[07:56:42] [PASSED] 52 VFs
[07:56:42] [PASSED] 53 VFs
[07:56:42] [PASSED] 54 VFs
[07:56:42] [PASSED] 55 VFs
[07:56:42] [PASSED] 56 VFs
[07:56:42] [PASSED] 57 VFs
[07:56:42] [PASSED] 58 VFs
[07:56:42] [PASSED] 59 VFs
[07:56:42] [PASSED] 60 VFs
[07:56:42] [PASSED] 61 VFs
[07:56:42] [PASSED] 62 VFs
[07:56:42] [PASSED] 63 VFs
[07:56:42] ================== [PASSED] fair_contexts ==================
[07:56:42] ===================== fair_doorbells  ======================
[07:56:42] [PASSED] 1 VF
[07:56:42] [PASSED] 2 VFs
[07:56:42] [PASSED] 3 VFs
[07:56:42] [PASSED] 4 VFs
[07:56:42] [PASSED] 5 VFs
[07:56:42] [PASSED] 6 VFs
[07:56:42] [PASSED] 7 VFs
[07:56:42] [PASSED] 8 VFs
[07:56:42] [PASSED] 9 VFs
[07:56:42] [PASSED] 10 VFs
[07:56:42] [PASSED] 11 VFs
[07:56:42] [PASSED] 12 VFs
[07:56:42] [PASSED] 13 VFs
[07:56:42] [PASSED] 14 VFs
[07:56:42] [PASSED] 15 VFs
[07:56:42] [PASSED] 16 VFs
[07:56:42] [PASSED] 17 VFs
[07:56:42] [PASSED] 18 VFs
[07:56:42] [PASSED] 19 VFs
[07:56:42] [PASSED] 20 VFs
[07:56:42] [PASSED] 21 VFs
[07:56:42] [PASSED] 22 VFs
[07:56:42] [PASSED] 23 VFs
[07:56:42] [PASSED] 24 VFs
[07:56:42] [PASSED] 25 VFs
[07:56:42] [PASSED] 26 VFs
[07:56:42] [PASSED] 27 VFs
[07:56:42] [PASSED] 28 VFs
[07:56:42] [PASSED] 29 VFs
[07:56:42] [PASSED] 30 VFs
[07:56:42] [PASSED] 31 VFs
[07:56:42] [PASSED] 32 VFs
[07:56:42] [PASSED] 33 VFs
[07:56:42] [PASSED] 34 VFs
[07:56:42] [PASSED] 35 VFs
[07:56:42] [PASSED] 36 VFs
[07:56:42] [PASSED] 37 VFs
[07:56:42] [PASSED] 38 VFs
[07:56:42] [PASSED] 39 VFs
[07:56:42] [PASSED] 40 VFs
[07:56:42] [PASSED] 41 VFs
[07:56:42] [PASSED] 42 VFs
[07:56:42] [PASSED] 43 VFs
[07:56:42] [PASSED] 44 VFs
[07:56:42] [PASSED] 45 VFs
[07:56:42] [PASSED] 46 VFs
[07:56:42] [PASSED] 47 VFs
[07:56:42] [PASSED] 48 VFs
[07:56:42] [PASSED] 49 VFs
[07:56:42] [PASSED] 50 VFs
[07:56:42] [PASSED] 51 VFs
[07:56:42] [PASSED] 52 VFs
[07:56:42] [PASSED] 53 VFs
[07:56:42] [PASSED] 54 VFs
[07:56:42] [PASSED] 55 VFs
[07:56:42] [PASSED] 56 VFs
[07:56:42] [PASSED] 57 VFs
[07:56:42] [PASSED] 58 VFs
[07:56:42] [PASSED] 59 VFs
[07:56:42] [PASSED] 60 VFs
[07:56:42] [PASSED] 61 VFs
[07:56:42] [PASSED] 62 VFs
[07:56:42] [PASSED] 63 VFs
[07:56:42] ================= [PASSED] fair_doorbells ==================
[07:56:42] ======================== fair_ggtt  ========================
[07:56:42] [PASSED] 1 VF
[07:56:42] [PASSED] 2 VFs
[07:56:42] [PASSED] 3 VFs
[07:56:42] [PASSED] 4 VFs
[07:56:42] [PASSED] 5 VFs
[07:56:42] [PASSED] 6 VFs
[07:56:42] [PASSED] 7 VFs
[07:56:42] [PASSED] 8 VFs
[07:56:42] [PASSED] 9 VFs
[07:56:42] [PASSED] 10 VFs
[07:56:42] [PASSED] 11 VFs
[07:56:42] [PASSED] 12 VFs
[07:56:42] [PASSED] 13 VFs
[07:56:42] [PASSED] 14 VFs
[07:56:42] [PASSED] 15 VFs
[07:56:42] [PASSED] 16 VFs
[07:56:42] [PASSED] 17 VFs
[07:56:42] [PASSED] 18 VFs
[07:56:42] [PASSED] 19 VFs
[07:56:42] [PASSED] 20 VFs
[07:56:42] [PASSED] 21 VFs
[07:56:42] [PASSED] 22 VFs
[07:56:42] [PASSED] 23 VFs
[07:56:42] [PASSED] 24 VFs
[07:56:42] [PASSED] 25 VFs
[07:56:42] [PASSED] 26 VFs
[07:56:42] [PASSED] 27 VFs
[07:56:42] [PASSED] 28 VFs
[07:56:42] [PASSED] 29 VFs
[07:56:42] [PASSED] 30 VFs
[07:56:42] [PASSED] 31 VFs
[07:56:42] [PASSED] 32 VFs
[07:56:42] [PASSED] 33 VFs
[07:56:42] [PASSED] 34 VFs
[07:56:42] [PASSED] 35 VFs
[07:56:42] [PASSED] 36 VFs
[07:56:42] [PASSED] 37 VFs
[07:56:42] [PASSED] 38 VFs
[07:56:42] [PASSED] 39 VFs
[07:56:42] [PASSED] 40 VFs
[07:56:42] [PASSED] 41 VFs
[07:56:42] [PASSED] 42 VFs
[07:56:42] [PASSED] 43 VFs
[07:56:42] [PASSED] 44 VFs
[07:56:42] [PASSED] 45 VFs
[07:56:42] [PASSED] 46 VFs
[07:56:42] [PASSED] 47 VFs
[07:56:42] [PASSED] 48 VFs
[07:56:42] [PASSED] 49 VFs
[07:56:42] [PASSED] 50 VFs
[07:56:42] [PASSED] 51 VFs
[07:56:42] [PASSED] 52 VFs
[07:56:42] [PASSED] 53 VFs
[07:56:42] [PASSED] 54 VFs
[07:56:42] [PASSED] 55 VFs
[07:56:42] [PASSED] 56 VFs
[07:56:42] [PASSED] 57 VFs
[07:56:42] [PASSED] 58 VFs
[07:56:42] [PASSED] 59 VFs
[07:56:42] [PASSED] 60 VFs
[07:56:42] [PASSED] 61 VFs
[07:56:42] [PASSED] 62 VFs
[07:56:42] [PASSED] 63 VFs
[07:56:42] ==================== [PASSED] fair_ggtt ====================
[07:56:42] ======================== fair_vram  ========================
[07:56:42] [PASSED] 1 VF
[07:56:42] [PASSED] 2 VFs
[07:56:42] [PASSED] 3 VFs
[07:56:42] [PASSED] 4 VFs
[07:56:42] [PASSED] 5 VFs
[07:56:42] [PASSED] 6 VFs
[07:56:42] [PASSED] 7 VFs
[07:56:42] [PASSED] 8 VFs
[07:56:42] [PASSED] 9 VFs
[07:56:42] [PASSED] 10 VFs
[07:56:42] [PASSED] 11 VFs
[07:56:42] [PASSED] 12 VFs
[07:56:42] [PASSED] 13 VFs
[07:56:42] [PASSED] 14 VFs
[07:56:42] [PASSED] 15 VFs
[07:56:42] [PASSED] 16 VFs
[07:56:42] [PASSED] 17 VFs
[07:56:42] [PASSED] 18 VFs
[07:56:42] [PASSED] 19 VFs
[07:56:42] [PASSED] 20 VFs
[07:56:42] [PASSED] 21 VFs
[07:56:42] [PASSED] 22 VFs
[07:56:42] [PASSED] 23 VFs
[07:56:42] [PASSED] 24 VFs
[07:56:42] [PASSED] 25 VFs
[07:56:42] [PASSED] 26 VFs
[07:56:42] [PASSED] 27 VFs
[07:56:42] [PASSED] 28 VFs
[07:56:42] [PASSED] 29 VFs
[07:56:42] [PASSED] 30 VFs
[07:56:42] [PASSED] 31 VFs
[07:56:42] [PASSED] 32 VFs
[07:56:42] [PASSED] 33 VFs
[07:56:42] [PASSED] 34 VFs
[07:56:42] [PASSED] 35 VFs
[07:56:42] [PASSED] 36 VFs
[07:56:42] [PASSED] 37 VFs
[07:56:42] [PASSED] 38 VFs
[07:56:42] [PASSED] 39 VFs
[07:56:42] [PASSED] 40 VFs
[07:56:42] [PASSED] 41 VFs
[07:56:42] [PASSED] 42 VFs
[07:56:42] [PASSED] 43 VFs
[07:56:42] [PASSED] 44 VFs
[07:56:42] [PASSED] 45 VFs
[07:56:42] [PASSED] 46 VFs
[07:56:42] [PASSED] 47 VFs
[07:56:42] [PASSED] 48 VFs
[07:56:42] [PASSED] 49 VFs
[07:56:42] [PASSED] 50 VFs
[07:56:42] [PASSED] 51 VFs
[07:56:42] [PASSED] 52 VFs
[07:56:42] [PASSED] 53 VFs
[07:56:43] [PASSED] 54 VFs
[07:56:43] [PASSED] 55 VFs
[07:56:43] [PASSED] 56 VFs
[07:56:43] [PASSED] 57 VFs
[07:56:43] [PASSED] 58 VFs
[07:56:43] [PASSED] 59 VFs
[07:56:43] [PASSED] 60 VFs
[07:56:43] [PASSED] 61 VFs
[07:56:43] [PASSED] 62 VFs
[07:56:43] [PASSED] 63 VFs
[07:56:43] ==================== [PASSED] fair_vram ====================
[07:56:43] ================== [PASSED] pf_gt_config ===================
[07:56:43] ===================== lmtt (1 subtest) =====================
[07:56:43] ======================== test_ops  =========================
[07:56:43] [PASSED] 2-level
[07:56:43] [PASSED] multi-level
[07:56:43] ==================== [PASSED] test_ops =====================
[07:56:43] ====================== [PASSED] lmtt =======================
[07:56:43] ================= pf_service (11 subtests) =================
[07:56:43] [PASSED] pf_negotiate_any
[07:56:43] [PASSED] pf_negotiate_base_match
[07:56:43] [PASSED] pf_negotiate_base_newer
[07:56:43] [PASSED] pf_negotiate_base_next
[07:56:43] [SKIPPED] pf_negotiate_base_older
[07:56:43] [PASSED] pf_negotiate_base_prev
[07:56:43] [PASSED] pf_negotiate_latest_match
[07:56:43] [PASSED] pf_negotiate_latest_newer
[07:56:43] [PASSED] pf_negotiate_latest_next
[07:56:43] [SKIPPED] pf_negotiate_latest_older
[07:56:43] [SKIPPED] pf_negotiate_latest_prev
[07:56:43] =================== [PASSED] pf_service ====================
[07:56:43] ================= xe_guc_g2g (2 subtests) ==================
[07:56:43] ============== xe_live_guc_g2g_kunit_default  ==============
[07:56:43] ========= [SKIPPED] xe_live_guc_g2g_kunit_default ==========
[07:56:43] ============== xe_live_guc_g2g_kunit_allmem  ===============
[07:56:43] ========== [SKIPPED] xe_live_guc_g2g_kunit_allmem ==========
[07:56:43] =================== [SKIPPED] xe_guc_g2g ===================
[07:56:43] =================== xe_mocs (2 subtests) ===================
[07:56:43] ================ xe_live_mocs_kernel_kunit  ================
[07:56:43] =========== [SKIPPED] xe_live_mocs_kernel_kunit ============
[07:56:43] ================ xe_live_mocs_reset_kunit  =================
[07:56:43] ============ [SKIPPED] xe_live_mocs_reset_kunit ============
[07:56:43] ==================== [SKIPPED] xe_mocs =====================
[07:56:43] ================= xe_migrate (2 subtests) ==================
[07:56:43] ================= xe_migrate_sanity_kunit  =================
[07:56:43] ============ [SKIPPED] xe_migrate_sanity_kunit =============
[07:56:43] ================== xe_validate_ccs_kunit  ==================
[07:56:43] ============= [SKIPPED] xe_validate_ccs_kunit ==============
[07:56:43] =================== [SKIPPED] xe_migrate ===================
[07:56:43] ================== xe_dma_buf (1 subtest) ==================
[07:56:43] ==================== xe_dma_buf_kunit  =====================
[07:56:43] ================ [SKIPPED] xe_dma_buf_kunit ================
[07:56:43] =================== [SKIPPED] xe_dma_buf ===================
[07:56:43] ================= xe_bo_shrink (1 subtest) =================
[07:56:43] =================== xe_bo_shrink_kunit  ====================
[07:56:43] =============== [SKIPPED] xe_bo_shrink_kunit ===============
[07:56:43] ================== [SKIPPED] xe_bo_shrink ==================
[07:56:43] ==================== xe_bo (2 subtests) ====================
[07:56:43] ================== xe_ccs_migrate_kunit  ===================
[07:56:43] ============== [SKIPPED] xe_ccs_migrate_kunit ==============
[07:56:43] ==================== xe_bo_evict_kunit  ====================
[07:56:43] =============== [SKIPPED] xe_bo_evict_kunit ================
[07:56:43] ===================== [SKIPPED] xe_bo ======================
[07:56:43] ==================== args (13 subtests) ====================
[07:56:43] [PASSED] count_args_test
[07:56:43] [PASSED] call_args_example
[07:56:43] [PASSED] call_args_test
[07:56:43] [PASSED] drop_first_arg_example
[07:56:43] [PASSED] drop_first_arg_test
[07:56:43] [PASSED] first_arg_example
[07:56:43] [PASSED] first_arg_test
[07:56:43] [PASSED] last_arg_example
[07:56:43] [PASSED] last_arg_test
[07:56:43] [PASSED] pick_arg_example
[07:56:43] [PASSED] if_args_example
[07:56:43] [PASSED] if_args_test
[07:56:43] [PASSED] sep_comma_example
[07:56:43] ====================== [PASSED] args =======================
[07:56:43] =================== xe_pci (3 subtests) ====================
[07:56:43] ==================== check_graphics_ip  ====================
[07:56:43] [PASSED] 12.00 Xe_LP
[07:56:43] [PASSED] 12.10 Xe_LP+
[07:56:43] [PASSED] 12.55 Xe_HPG
[07:56:43] [PASSED] 12.60 Xe_HPC
[07:56:43] [PASSED] 12.70 Xe_LPG
[07:56:43] [PASSED] 12.71 Xe_LPG
[07:56:43] [PASSED] 12.74 Xe_LPG+
[07:56:43] [PASSED] 20.01 Xe2_HPG
[07:56:43] [PASSED] 20.02 Xe2_HPG
[07:56:43] [PASSED] 20.04 Xe2_LPG
[07:56:43] [PASSED] 30.00 Xe3_LPG
[07:56:43] [PASSED] 30.01 Xe3_LPG
[07:56:43] [PASSED] 30.03 Xe3_LPG
[07:56:43] [PASSED] 30.04 Xe3_LPG
[07:56:43] [PASSED] 30.05 Xe3_LPG
[07:56:43] [PASSED] 35.10 Xe3p_LPG
[07:56:43] [PASSED] 35.11 Xe3p_XPC
[07:56:43] ================ [PASSED] check_graphics_ip ================
[07:56:43] ===================== check_media_ip  ======================
[07:56:43] [PASSED] 12.00 Xe_M
[07:56:43] [PASSED] 12.55 Xe_HPM
[07:56:43] [PASSED] 13.00 Xe_LPM+
[07:56:43] [PASSED] 13.01 Xe2_HPM
[07:56:43] [PASSED] 20.00 Xe2_LPM
[07:56:43] [PASSED] 30.00 Xe3_LPM
[07:56:43] [PASSED] 30.02 Xe3_LPM
[07:56:43] [PASSED] 35.00 Xe3p_LPM
[07:56:43] [PASSED] 35.03 Xe3p_HPM
[07:56:43] ================= [PASSED] check_media_ip ==================
[07:56:43] =================== check_platform_desc  ===================
[07:56:43] [PASSED] 0x9A60 (TIGERLAKE)
[07:56:43] [PASSED] 0x9A68 (TIGERLAKE)
[07:56:43] [PASSED] 0x9A70 (TIGERLAKE)
[07:56:43] [PASSED] 0x9A40 (TIGERLAKE)
[07:56:43] [PASSED] 0x9A49 (TIGERLAKE)
[07:56:43] [PASSED] 0x9A59 (TIGERLAKE)
[07:56:43] [PASSED] 0x9A78 (TIGERLAKE)
[07:56:43] [PASSED] 0x9AC0 (TIGERLAKE)
[07:56:43] [PASSED] 0x9AC9 (TIGERLAKE)
[07:56:43] [PASSED] 0x9AD9 (TIGERLAKE)
[07:56:43] [PASSED] 0x9AF8 (TIGERLAKE)
[07:56:43] [PASSED] 0x4C80 (ROCKETLAKE)
[07:56:43] [PASSED] 0x4C8A (ROCKETLAKE)
[07:56:43] [PASSED] 0x4C8B (ROCKETLAKE)
[07:56:43] [PASSED] 0x4C8C (ROCKETLAKE)
[07:56:43] [PASSED] 0x4C90 (ROCKETLAKE)
[07:56:43] [PASSED] 0x4C9A (ROCKETLAKE)
[07:56:43] [PASSED] 0x4680 (ALDERLAKE_S)
[07:56:43] [PASSED] 0x4682 (ALDERLAKE_S)
[07:56:43] [PASSED] 0x4688 (ALDERLAKE_S)
[07:56:43] [PASSED] 0x468A (ALDERLAKE_S)
[07:56:43] [PASSED] 0x468B (ALDERLAKE_S)
[07:56:43] [PASSED] 0x4690 (ALDERLAKE_S)
[07:56:43] [PASSED] 0x4692 (ALDERLAKE_S)
[07:56:43] [PASSED] 0x4693 (ALDERLAKE_S)
[07:56:43] [PASSED] 0x46A0 (ALDERLAKE_P)
[07:56:43] [PASSED] 0x46A1 (ALDERLAKE_P)
[07:56:43] [PASSED] 0x46A2 (ALDERLAKE_P)
[07:56:43] [PASSED] 0x46A3 (ALDERLAKE_P)
[07:56:43] [PASSED] 0x46A6 (ALDERLAKE_P)
[07:56:43] [PASSED] 0x46A8 (ALDERLAKE_P)
[07:56:43] [PASSED] 0x46AA (ALDERLAKE_P)
[07:56:43] [PASSED] 0x462A (ALDERLAKE_P)
[07:56:43] [PASSED] 0x4626 (ALDERLAKE_P)
[07:56:43] [PASSED] 0x4628 (ALDERLAKE_P)
[07:56:43] [PASSED] 0x46B0 (ALDERLAKE_P)
[07:56:43] [PASSED] 0x46B1 (ALDERLAKE_P)
[07:56:43] [PASSED] 0x46B2 (ALDERLAKE_P)
[07:56:43] [PASSED] 0x46B3 (ALDERLAKE_P)
[07:56:43] [PASSED] 0x46C0 (ALDERLAKE_P)
[07:56:43] [PASSED] 0x46C1 (ALDERLAKE_P)
[07:56:43] [PASSED] 0x46C2 (ALDERLAKE_P)
[07:56:43] [PASSED] 0x46C3 (ALDERLAKE_P)
[07:56:43] [PASSED] 0x46D0 (ALDERLAKE_N)
[07:56:43] [PASSED] 0x46D1 (ALDERLAKE_N)
[07:56:43] [PASSED] 0x46D2 (ALDERLAKE_N)
[07:56:43] [PASSED] 0x46D3 (ALDERLAKE_N)
[07:56:43] [PASSED] 0x46D4 (ALDERLAKE_N)
[07:56:43] [PASSED] 0xA721 (ALDERLAKE_P)
[07:56:43] [PASSED] 0xA7A1 (ALDERLAKE_P)
[07:56:43] [PASSED] 0xA7A9 (ALDERLAKE_P)
[07:56:43] [PASSED] 0xA7AC (ALDERLAKE_P)
[07:56:43] [PASSED] 0xA7AD (ALDERLAKE_P)
[07:56:43] [PASSED] 0xA720 (ALDERLAKE_P)
[07:56:43] [PASSED] 0xA7A0 (ALDERLAKE_P)
[07:56:43] [PASSED] 0xA7A8 (ALDERLAKE_P)
[07:56:43] [PASSED] 0xA7AA (ALDERLAKE_P)
[07:56:43] [PASSED] 0xA7AB (ALDERLAKE_P)
[07:56:43] [PASSED] 0xA780 (ALDERLAKE_S)
[07:56:43] [PASSED] 0xA781 (ALDERLAKE_S)
[07:56:43] [PASSED] 0xA782 (ALDERLAKE_S)
[07:56:43] [PASSED] 0xA783 (ALDERLAKE_S)
[07:56:43] [PASSED] 0xA788 (ALDERLAKE_S)
[07:56:43] [PASSED] 0xA789 (ALDERLAKE_S)
[07:56:43] [PASSED] 0xA78A (ALDERLAKE_S)
[07:56:43] [PASSED] 0xA78B (ALDERLAKE_S)
[07:56:43] [PASSED] 0x4905 (DG1)
[07:56:43] [PASSED] 0x4906 (DG1)
[07:56:43] [PASSED] 0x4907 (DG1)
[07:56:43] [PASSED] 0x4908 (DG1)
[07:56:43] [PASSED] 0x4909 (DG1)
[07:56:43] [PASSED] 0x56C0 (DG2)
[07:56:43] [PASSED] 0x56C2 (DG2)
[07:56:43] [PASSED] 0x56C1 (DG2)
[07:56:43] [PASSED] 0x7D51 (METEORLAKE)
[07:56:43] [PASSED] 0x7DD1 (METEORLAKE)
[07:56:43] [PASSED] 0x7D41 (METEORLAKE)
[07:56:43] [PASSED] 0x7D67 (METEORLAKE)
[07:56:43] [PASSED] 0xB640 (METEORLAKE)
[07:56:43] [PASSED] 0x56A0 (DG2)
[07:56:43] [PASSED] 0x56A1 (DG2)
[07:56:43] [PASSED] 0x56A2 (DG2)
[07:56:43] [PASSED] 0x56BE (DG2)
[07:56:43] [PASSED] 0x56BF (DG2)
[07:56:43] [PASSED] 0x5690 (DG2)
[07:56:43] [PASSED] 0x5691 (DG2)
[07:56:43] [PASSED] 0x5692 (DG2)
[07:56:43] [PASSED] 0x56A5 (DG2)
[07:56:43] [PASSED] 0x56A6 (DG2)
[07:56:43] [PASSED] 0x56B0 (DG2)
[07:56:43] [PASSED] 0x56B1 (DG2)
[07:56:43] [PASSED] 0x56BA (DG2)
[07:56:43] [PASSED] 0x56BB (DG2)
[07:56:43] [PASSED] 0x56BC (DG2)
[07:56:43] [PASSED] 0x56BD (DG2)
[07:56:43] [PASSED] 0x5693 (DG2)
[07:56:43] [PASSED] 0x5694 (DG2)
[07:56:43] [PASSED] 0x5695 (DG2)
[07:56:43] [PASSED] 0x56A3 (DG2)
[07:56:43] [PASSED] 0x56A4 (DG2)
[07:56:43] [PASSED] 0x56B2 (DG2)
[07:56:43] [PASSED] 0x56B3 (DG2)
[07:56:43] [PASSED] 0x5696 (DG2)
[07:56:43] [PASSED] 0x5697 (DG2)
[07:56:43] [PASSED] 0xB69 (PVC)
[07:56:43] [PASSED] 0xB6E (PVC)
[07:56:43] [PASSED] 0xBD4 (PVC)
[07:56:43] [PASSED] 0xBD5 (PVC)
[07:56:43] [PASSED] 0xBD6 (PVC)
[07:56:43] [PASSED] 0xBD7 (PVC)
[07:56:43] [PASSED] 0xBD8 (PVC)
[07:56:43] [PASSED] 0xBD9 (PVC)
[07:56:43] [PASSED] 0xBDA (PVC)
[07:56:43] [PASSED] 0xBDB (PVC)
[07:56:43] [PASSED] 0xBE0 (PVC)
[07:56:43] [PASSED] 0xBE1 (PVC)
[07:56:43] [PASSED] 0xBE5 (PVC)
[07:56:43] [PASSED] 0x7D40 (METEORLAKE)
[07:56:43] [PASSED] 0x7D45 (METEORLAKE)
[07:56:43] [PASSED] 0x7D55 (METEORLAKE)
[07:56:43] [PASSED] 0x7D60 (METEORLAKE)
[07:56:43] [PASSED] 0x7DD5 (METEORLAKE)
[07:56:43] [PASSED] 0x6420 (LUNARLAKE)
[07:56:43] [PASSED] 0x64A0 (LUNARLAKE)
[07:56:43] [PASSED] 0x64B0 (LUNARLAKE)
[07:56:43] [PASSED] 0xE202 (BATTLEMAGE)
[07:56:43] [PASSED] 0xE209 (BATTLEMAGE)
[07:56:43] [PASSED] 0xE20B (BATTLEMAGE)
[07:56:43] [PASSED] 0xE20C (BATTLEMAGE)
[07:56:43] [PASSED] 0xE20D (BATTLEMAGE)
[07:56:43] [PASSED] 0xE210 (BATTLEMAGE)
[07:56:43] [PASSED] 0xE211 (BATTLEMAGE)
[07:56:43] [PASSED] 0xE212 (BATTLEMAGE)
[07:56:43] [PASSED] 0xE216 (BATTLEMAGE)
[07:56:43] [PASSED] 0xE220 (BATTLEMAGE)
[07:56:43] [PASSED] 0xE221 (BATTLEMAGE)
[07:56:43] [PASSED] 0xE222 (BATTLEMAGE)
[07:56:43] [PASSED] 0xE223 (BATTLEMAGE)
[07:56:43] [PASSED] 0xB080 (PANTHERLAKE)
[07:56:43] [PASSED] 0xB081 (PANTHERLAKE)
[07:56:43] [PASSED] 0xB082 (PANTHERLAKE)
[07:56:43] [PASSED] 0xB083 (PANTHERLAKE)
[07:56:43] [PASSED] 0xB084 (PANTHERLAKE)
[07:56:43] [PASSED] 0xB085 (PANTHERLAKE)
[07:56:43] [PASSED] 0xB086 (PANTHERLAKE)
[07:56:43] [PASSED] 0xB087 (PANTHERLAKE)
[07:56:43] [PASSED] 0xB08F (PANTHERLAKE)
[07:56:43] [PASSED] 0xB090 (PANTHERLAKE)
[07:56:43] [PASSED] 0xB0A0 (PANTHERLAKE)
[07:56:43] [PASSED] 0xB0B0 (PANTHERLAKE)
[07:56:43] [PASSED] 0xFD80 (PANTHERLAKE)
[07:56:43] [PASSED] 0xFD81 (PANTHERLAKE)
[07:56:43] [PASSED] 0xD740 (NOVALAKE_S)
[07:56:43] [PASSED] 0xD741 (NOVALAKE_S)
[07:56:43] [PASSED] 0xD742 (NOVALAKE_S)
[07:56:43] [PASSED] 0xD743 (NOVALAKE_S)
[07:56:43] [PASSED] 0xD744 (NOVALAKE_S)
[07:56:43] [PASSED] 0xD745 (NOVALAKE_S)
[07:56:43] [PASSED] 0x674C (CRESCENTISLAND)
[07:56:43] [PASSED] 0xD750 (NOVALAKE_P)
[07:56:43] [PASSED] 0xD751 (NOVALAKE_P)
[07:56:43] [PASSED] 0xD752 (NOVALAKE_P)
[07:56:43] [PASSED] 0xD753 (NOVALAKE_P)
[07:56:43] [PASSED] 0xD754 (NOVALAKE_P)
[07:56:43] [PASSED] 0xD755 (NOVALAKE_P)
[07:56:43] [PASSED] 0xD756 (NOVALAKE_P)
[07:56:43] [PASSED] 0xD757 (NOVALAKE_P)
[07:56:43] [PASSED] 0xD75F (NOVALAKE_P)
[07:56:43] =============== [PASSED] check_platform_desc ===============
[07:56:43] ===================== [PASSED] xe_pci ======================
[07:56:43] =================== xe_rtp (2 subtests) ====================
[07:56:43] =============== xe_rtp_process_to_sr_tests  ================
[07:56:43] [PASSED] coalesce-same-reg
[07:56:43] [PASSED] no-match-no-add
[07:56:43] [PASSED] match-or
[07:56:43] [PASSED] match-or-xfail
[07:56:43] [PASSED] no-match-no-add-multiple-rules
[07:56:43] [PASSED] two-regs-two-entries
[07:56:43] [PASSED] clr-one-set-other
[07:56:43] [PASSED] set-field
[07:56:43] [PASSED] conflict-duplicate
stty: 'standard input': Inappropriate ioctl for device
[07:56:43] [PASSED] conflict-not-disjoint
[07:56:43] [PASSED] conflict-reg-type
[07:56:43] =========== [PASSED] xe_rtp_process_to_sr_tests ============
[07:56:43] ================== xe_rtp_process_tests  ===================
[07:56:43] [PASSED] active1
[07:56:43] [PASSED] active2
[07:56:43] [PASSED] active-inactive
[07:56:43] [PASSED] inactive-active
[07:56:43] [PASSED] inactive-1st_or_active-inactive
[07:56:43] [PASSED] inactive-2nd_or_active-inactive
[07:56:43] [PASSED] inactive-last_or_active-inactive
[07:56:43] [PASSED] inactive-no_or_active-inactive
[07:56:43] ============== [PASSED] xe_rtp_process_tests ===============
[07:56:43] ===================== [PASSED] xe_rtp ======================
[07:56:43] ==================== xe_wa (1 subtest) =====================
[07:56:43] ======================== xe_wa_gt  =========================
[07:56:43] [PASSED] TIGERLAKE B0
[07:56:43] [PASSED] DG1 A0
[07:56:43] [PASSED] DG1 B0
[07:56:43] [PASSED] ALDERLAKE_S A0
[07:56:43] [PASSED] ALDERLAKE_S B0
[07:56:43] [PASSED] ALDERLAKE_S C0
[07:56:43] [PASSED] ALDERLAKE_S D0
[07:56:43] [PASSED] ALDERLAKE_P A0
[07:56:43] [PASSED] ALDERLAKE_P B0
[07:56:43] [PASSED] ALDERLAKE_P C0
[07:56:43] [PASSED] ALDERLAKE_S RPLS D0
[07:56:43] [PASSED] ALDERLAKE_P RPLU E0
[07:56:43] [PASSED] DG2 G10 C0
[07:56:43] [PASSED] DG2 G11 B1
[07:56:43] [PASSED] DG2 G12 A1
[07:56:43] [PASSED] METEORLAKE 12.70(Xe_LPG) A0 13.00(Xe_LPM+) A0
[07:56:43] [PASSED] METEORLAKE 12.71(Xe_LPG) A0 13.00(Xe_LPM+) A0
[07:56:43] [PASSED] METEORLAKE 12.74(Xe_LPG+) A0 13.00(Xe_LPM+) A0
[07:56:43] [PASSED] LUNARLAKE 20.04(Xe2_LPG) A0 20.00(Xe2_LPM) A0
[07:56:43] [PASSED] LUNARLAKE 20.04(Xe2_LPG) B0 20.00(Xe2_LPM) A0
[07:56:43] [PASSED] BATTLEMAGE 20.01(Xe2_HPG) A0 13.01(Xe2_HPM) A1
[07:56:43] [PASSED] PANTHERLAKE 30.00(Xe3_LPG) A0 30.00(Xe3_LPM) A0
[07:56:43] ==================== [PASSED] xe_wa_gt =====================
[07:56:43] ====================== [PASSED] xe_wa ======================
[07:56:43] ============================================================
[07:56:43] Testing complete. Ran 597 tests: passed: 579, skipped: 18
[07:56:43] Elapsed time: 36.067s total, 4.298s configuring, 31.152s building, 0.608s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/tests/.kunitconfig
ERROR:root:In file included from ../drivers/gpu/buddy.c:12:
../include/linux/gpu_buddy.h:220:15: error: return type defaults to ‘int’ [-Werror=implicit-int]
  220 | static inline gpu_buddy_driver_lock_held(struct gpu_buddy *mm)
      |               ^~~~~~~~~~~~~~~~~~~~~~~~~~
../include/linux/gpu_buddy.h: In function ‘gpu_buddy_driver_lock_held’:
../include/linux/gpu_buddy.h:222:1: error: control reaches end of non-void function [-Werror=return-type]
  222 | }
      | ^
cc1: some warnings being treated as errors
make[5]: *** [../scripts/Makefile.build:289: drivers/gpu/buddy.o] Error 1
make[5]: *** Waiting for unfinished jobs....
In file included from ../drivers/gpu/drm/drm_buddy.c:13:
../include/linux/gpu_buddy.h:220:15: error: return type defaults to ‘int’ [-Werror=implicit-int]
  220 | static inline gpu_buddy_driver_lock_held(struct gpu_buddy *mm)
      |               ^~~~~~~~~~~~~~~~~~~~~~~~~~
../include/linux/gpu_buddy.h: In function ‘gpu_buddy_driver_lock_held’:
../include/linux/gpu_buddy.h:222:1: error: control reaches end of non-void function [-Werror=return-type]
  222 | }
      | ^
cc1: some warnings being treated as errors
make[6]: *** [../scripts/Makefile.build:289: drivers/gpu/drm/drm_buddy.o] Error 1
make[6]: *** Waiting for unfinished jobs....
make[5]: *** [../scripts/Makefile.build:549: drivers/gpu/drm] Error 2
make[4]: *** [../scripts/Makefile.build:549: drivers/gpu] Error 2
make[3]: *** [../scripts/Makefile.build:549: drivers] Error 2
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [/kernel/Makefile:2105: .] Error 2
make[1]: *** [/kernel/Makefile:248: __sub-make] Error 2
make: *** [Makefile:248: __sub-make] Error 2

[07:56:43] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[07:56:44] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for gpu buddy manager
  2026-04-16  7:49 ` [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for gpu buddy manager Tejas Upadhyay
@ 2026-04-16  8:55   ` Matthew Auld
  2026-04-16  9:43     ` Upadhyay, Tejas
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Auld @ 2026-04-16  8:55 UTC (permalink / raw)
  To: Tejas Upadhyay, intel-xe
  Cc: matthew.brost, thomas.hellstrom, himal.prasad.ghimiray

On 16/04/2026 08:49, Tejas Upadhyay wrote:
> Integrating lockdep into the gpu_buddy manager as standard practice for
> verifying that internal resources are correctly protected by their
> associated locks.
> 
> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> ---
>   drivers/gpu/buddy.c                  | 18 ++++++++++--
>   drivers/gpu/drm/drm_buddy.c          |  7 +++--
>   drivers/gpu/drm/xe/xe_ttm_vram_mgr.c |  3 ++
>   include/linux/gpu_buddy.h            | 41 ++++++++++++++++++++++++++++
>   4 files changed, 65 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/buddy.c b/drivers/gpu/buddy.c
> index 52686672e99f..53ff85ac2105 100644
> --- a/drivers/gpu/buddy.c
> +++ b/drivers/gpu/buddy.c
> @@ -437,6 +437,9 @@ int gpu_buddy_init(struct gpu_buddy *mm, u64 size, u64 chunk_size)
>   		root_count++;
>   	} while (size);
>   
> +#ifdef CONFIG_LOCKDEP
> +	mm->lock_dep_map = NULL;
> +#endif
>   	return 0;
>   
>   out_free_roots:
> @@ -464,6 +467,7 @@ void gpu_buddy_fini(struct gpu_buddy *mm)
>   	unsigned int order;
>   	int i;
>   
> +	gpu_buddy_driver_lock_held(mm);
>   	size = mm->size;
>   
>   	for (i = 0; i < mm->n_roots; ++i) {
> @@ -538,6 +542,7 @@ void gpu_buddy_reset_clear(struct gpu_buddy *mm, bool is_clear)
>   	unsigned int order;
>   	int i;
>   
> +	gpu_buddy_driver_lock_held(mm);
>   	size = mm->size;
>   	for (i = 0; i < mm->n_roots; ++i) {
>   		order = ilog2(size) - ilog2(mm->chunk_size);
> @@ -580,6 +585,7 @@ EXPORT_SYMBOL(gpu_buddy_reset_clear);
>   void gpu_buddy_free_block(struct gpu_buddy *mm,
>   			  struct gpu_buddy_block *block)
>   {
> +	gpu_buddy_driver_lock_held(mm);
>   	BUG_ON(!gpu_buddy_block_is_allocated(block));
>   	mm->avail += gpu_buddy_block_size(mm, block);
>   	if (gpu_buddy_block_is_clear(block))
> @@ -633,6 +639,7 @@ void gpu_buddy_free_list(struct gpu_buddy *mm,
>   {
>   	bool mark_clear = flags & GPU_BUDDY_CLEARED;
>   
> +	gpu_buddy_driver_lock_held(mm);
>   	__gpu_buddy_free_list(mm, objects, mark_clear, !mark_clear);
>   }
>   EXPORT_SYMBOL(gpu_buddy_free_list);
> @@ -1172,6 +1179,8 @@ int gpu_buddy_block_trim(struct gpu_buddy *mm,
>   	u64 new_start;
>   	int err;
>   
> +	gpu_buddy_driver_lock_held(mm);
> +
>   	if (!list_is_singular(blocks))
>   		return -EINVAL;
>   
> @@ -1287,6 +1296,8 @@ int gpu_buddy_alloc_blocks(struct gpu_buddy *mm,
>   	unsigned long pages;
>   	int err;
>   
> +	gpu_buddy_driver_lock_held(mm);
> +
>   	if (size < mm->chunk_size)
>   		return -EINVAL;
>   
> @@ -1458,9 +1469,11 @@ EXPORT_SYMBOL(gpu_buddy_alloc_blocks);
>   void gpu_buddy_block_print(struct gpu_buddy *mm,
>   			   struct gpu_buddy_block *block)
>   {
> -	u64 start = gpu_buddy_block_offset(block);
> -	u64 size = gpu_buddy_block_size(mm, block);
> +	u64 start, size;
>   
> +	gpu_buddy_driver_lock_held(mm);

I don't think we want this one. The mm interaction is just for immutable 
state, and the block itself is essentially owned by the caller. Same 
reason why we don't want annotations for stuff like 
gpu_buddy_block_offset() etc.

> +	start = gpu_buddy_block_offset(block);
> +	size = gpu_buddy_block_size(mm, block);
>   	pr_info("%#018llx-%#018llx: %llu\n", start, start + size, size);
>   }
>   EXPORT_SYMBOL(gpu_buddy_block_print);
> @@ -1475,6 +1488,7 @@ void gpu_buddy_print(struct gpu_buddy *mm)
>   {
>   	int order;
>   
> +	gpu_buddy_driver_lock_held(mm);
>   	pr_info("chunk_size: %lluKiB, total: %lluMiB, free: %lluMiB, clear_free: %lluMiB\n",
>   		mm->chunk_size >> 10, mm->size >> 20, mm->avail >> 20, mm->clear_avail >> 20);
>   
> diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
> index 841f3de5f307..f4ad09b8a36e 100644
> --- a/drivers/gpu/drm/drm_buddy.c
> +++ b/drivers/gpu/drm/drm_buddy.c
> @@ -25,9 +25,11 @@ void drm_buddy_block_print(struct gpu_buddy *mm,
>   			   struct gpu_buddy_block *block,
>   			   struct drm_printer *p)
>   {
> -	u64 start = gpu_buddy_block_offset(block);
> -	u64 size = gpu_buddy_block_size(mm, block);
> +	u64 start, size;
>   
> +	gpu_buddy_driver_lock_held(mm);
> +	start = gpu_buddy_block_offset(block);
> +	size = gpu_buddy_block_size(mm, block);

Same here.

>   	drm_printf(p, "%#018llx-%#018llx: %llu\n", start, start + size, size);
>   }
>   EXPORT_SYMBOL(drm_buddy_block_print);
> @@ -42,6 +44,7 @@ void drm_buddy_print(struct gpu_buddy *mm, struct drm_printer *p)
>   {
>   	int order;
>   
> +	gpu_buddy_driver_lock_held(mm);
>   	drm_printf(p, "chunk_size: %lluKiB, total: %lluMiB, free: %lluMiB, clear_free: %lluMiB\n",
>   		   mm->chunk_size >> 10, mm->size >> 20, mm->avail >> 20, mm->clear_avail >> 20);
>   
> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> index 01a9b92772f8..935e589dd4b0 100644
> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> @@ -293,7 +293,9 @@ static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
>   
>   	WARN_ON_ONCE(mgr->visible_avail != mgr->visible_size);
>   
> +	mutex_lock(&mgr->lock);
>   	gpu_buddy_fini(&mgr->mm);
> +	mutex_unlock(&mgr->lock);

This shouldn't need a lock. Annotation for this one should also be dropped.

>   
>   	ttm_resource_manager_cleanup(&mgr->manager);
>   
> @@ -328,6 +330,7 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
>   	if (err)
>   		return err;
>   
> +	gpu_buddy_driver_set_lock(&mgr->mm, &mgr->lock);
>   	ttm_set_driver_manager(&xe->ttm, mem_type, &mgr->manager);
>   	ttm_resource_manager_set_used(&mgr->manager, true);
>   
> diff --git a/include/linux/gpu_buddy.h b/include/linux/gpu_buddy.h
> index 5fa917ba5450..c174de80ad72 100644
> --- a/include/linux/gpu_buddy.h
> +++ b/include/linux/gpu_buddy.h
> @@ -154,6 +154,7 @@ struct gpu_buddy_block {
>    * @avail: Total free space currently available for allocation in bytes.
>    * @clear_avail: Free space available in the clear tree (zeroed memory) in bytes.
>    *               This is a subset of @avail.
> + * @lock_dep_map: Annotates gpu_buddy API with a driver provided lock.
>    */
>   struct gpu_buddy {
>   /* private: */
> @@ -179,8 +180,48 @@ struct gpu_buddy {
>   	u64 size;
>   	u64 avail;
>   	u64 clear_avail;
> +#ifdef CONFIG_LOCKDEP
> +	struct lockdep_map *lock_dep_map;
> +#endif
>   };
>   
> +#ifdef CONFIG_LOCKDEP
> +/**
> + * gpu_buddy_driver_set_lock() - Set the lock protecting accesses to GPU BUDDY
> + * @mm: Pointer to GPU buddy structure.
> + * @lock: the lock used to protect the gpu buddy. The locking primitive
> + * must contain a dep_map field.
> + *
> + * Call this to annotate gpu_buddy APIs which access/modify gpu_buddy manager
> + */
> +#define gpu_buddy_driver_set_lock(mm, lock) \
> +	do { \
> +		struct gpu_buddy *__mm = (mm); \
> +		if (!WARN(__mm->lock_dep_map, "GPU BUDDY MM lock should be set only once.")) \
> +			__mm->lock_dep_map = &(lock)->dep_map; \
> +	} while (0)
> +#else
> +#define gpu_buddy_driver_set_lock(mm, lock) do { (void)(mm); (void)(lock); } while (0)
> +#endif
> +
> +#ifdef CONFIG_LOCKDEP
> +/**
> + * gpu_buddy_driver_lock_held() - Assert GPU BUDDY manager lock is held
> + * @mm: Pointer to the GPU BUDDY structure.
> + *
> + * Ensure driver lock is held.
> + */
> +static inline void gpu_buddy_driver_lock_held(struct gpu_buddy *mm)
> +{
> +	if ((mm)->lock_dep_map)
> +		lockdep_assert(lock_is_held_type((mm)->lock_dep_map, 0));
> +}
> +#else
> +static inline gpu_buddy_driver_lock_held(struct gpu_buddy *mm)
> +{
> +}
> +#endif
> +
>   static inline u64
>   gpu_buddy_block_offset(const struct gpu_buddy_block *block)
>   {


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for gpu buddy manager
  2026-04-16  8:55   ` Matthew Auld
@ 2026-04-16  9:43     ` Upadhyay, Tejas
  2026-04-16  9:56       ` Matthew Auld
  0 siblings, 1 reply; 19+ messages in thread
From: Upadhyay, Tejas @ 2026-04-16  9:43 UTC (permalink / raw)
  To: Auld, Matthew, intel-xe@lists.freedesktop.org
  Cc: Brost, Matthew, thomas.hellstrom@linux.intel.com,
	Ghimiray, Himal Prasad



> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: 16 April 2026 14:25
> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> xe@lists.freedesktop.org
> Cc: Brost, Matthew <matthew.brost@intel.com>;
> thomas.hellstrom@linux.intel.com; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for gpu
> buddy manager
> 
> On 16/04/2026 08:49, Tejas Upadhyay wrote:
> > Integrating lockdep into the gpu_buddy manager as standard practice
> > for verifying that internal resources are correctly protected by their
> > associated locks.
> >
> > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > ---
> >   drivers/gpu/buddy.c                  | 18 ++++++++++--
> >   drivers/gpu/drm/drm_buddy.c          |  7 +++--
> >   drivers/gpu/drm/xe/xe_ttm_vram_mgr.c |  3 ++
> >   include/linux/gpu_buddy.h            | 41 ++++++++++++++++++++++++++++
> >   4 files changed, 65 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/buddy.c b/drivers/gpu/buddy.c index
> > 52686672e99f..53ff85ac2105 100644
> > --- a/drivers/gpu/buddy.c
> > +++ b/drivers/gpu/buddy.c
> > @@ -437,6 +437,9 @@ int gpu_buddy_init(struct gpu_buddy *mm, u64
> size, u64 chunk_size)
> >   		root_count++;
> >   	} while (size);
> >
> > +#ifdef CONFIG_LOCKDEP
> > +	mm->lock_dep_map = NULL;
> > +#endif
> >   	return 0;
> >
> >   out_free_roots:
> > @@ -464,6 +467,7 @@ void gpu_buddy_fini(struct gpu_buddy *mm)
> >   	unsigned int order;
> >   	int i;
> >
> > +	gpu_buddy_driver_lock_held(mm);
> >   	size = mm->size;
> >
> >   	for (i = 0; i < mm->n_roots; ++i) { @@ -538,6 +542,7 @@ void
> > gpu_buddy_reset_clear(struct gpu_buddy *mm, bool is_clear)
> >   	unsigned int order;
> >   	int i;
> >
> > +	gpu_buddy_driver_lock_held(mm);
> >   	size = mm->size;
> >   	for (i = 0; i < mm->n_roots; ++i) {
> >   		order = ilog2(size) - ilog2(mm->chunk_size); @@ -580,6
> +585,7 @@
> > EXPORT_SYMBOL(gpu_buddy_reset_clear);
> >   void gpu_buddy_free_block(struct gpu_buddy *mm,
> >   			  struct gpu_buddy_block *block)
> >   {
> > +	gpu_buddy_driver_lock_held(mm);
> >   	BUG_ON(!gpu_buddy_block_is_allocated(block));
> >   	mm->avail += gpu_buddy_block_size(mm, block);
> >   	if (gpu_buddy_block_is_clear(block)) @@ -633,6 +639,7 @@ void
> > gpu_buddy_free_list(struct gpu_buddy *mm,
> >   {
> >   	bool mark_clear = flags & GPU_BUDDY_CLEARED;
> >
> > +	gpu_buddy_driver_lock_held(mm);
> >   	__gpu_buddy_free_list(mm, objects, mark_clear, !mark_clear);
> >   }
> >   EXPORT_SYMBOL(gpu_buddy_free_list);
> > @@ -1172,6 +1179,8 @@ int gpu_buddy_block_trim(struct gpu_buddy
> *mm,
> >   	u64 new_start;
> >   	int err;
> >
> > +	gpu_buddy_driver_lock_held(mm);
> > +
> >   	if (!list_is_singular(blocks))
> >   		return -EINVAL;
> >
> > @@ -1287,6 +1296,8 @@ int gpu_buddy_alloc_blocks(struct gpu_buddy
> *mm,
> >   	unsigned long pages;
> >   	int err;
> >
> > +	gpu_buddy_driver_lock_held(mm);
> > +
> >   	if (size < mm->chunk_size)
> >   		return -EINVAL;
> >
> > @@ -1458,9 +1469,11 @@ EXPORT_SYMBOL(gpu_buddy_alloc_blocks);
> >   void gpu_buddy_block_print(struct gpu_buddy *mm,
> >   			   struct gpu_buddy_block *block)
> >   {
> > -	u64 start = gpu_buddy_block_offset(block);
> > -	u64 size = gpu_buddy_block_size(mm, block);
> > +	u64 start, size;
> >
> > +	gpu_buddy_driver_lock_held(mm);
> 
> I don't think we want this one. The mm interaction is just for immutable state,
> and the block itself is essentially owned by the caller. Same reason why we
> don't want annotations for stuff like
> gpu_buddy_block_offset() etc.

Yes, makes sense. I will change it.

> 
> > +	start = gpu_buddy_block_offset(block);
> > +	size = gpu_buddy_block_size(mm, block);
> >   	pr_info("%#018llx-%#018llx: %llu\n", start, start + size, size);
> >   }
> >   EXPORT_SYMBOL(gpu_buddy_block_print);
> > @@ -1475,6 +1488,7 @@ void gpu_buddy_print(struct gpu_buddy *mm)
> >   {
> >   	int order;
> >
> > +	gpu_buddy_driver_lock_held(mm);
> >   	pr_info("chunk_size: %lluKiB, total: %lluMiB, free: %lluMiB, clear_free:
> %lluMiB\n",
> >   		mm->chunk_size >> 10, mm->size >> 20, mm->avail >> 20,
> > mm->clear_avail >> 20);
> >
> > diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
> > index 841f3de5f307..f4ad09b8a36e 100644
> > --- a/drivers/gpu/drm/drm_buddy.c
> > +++ b/drivers/gpu/drm/drm_buddy.c
> > @@ -25,9 +25,11 @@ void drm_buddy_block_print(struct gpu_buddy
> *mm,
> >   			   struct gpu_buddy_block *block,
> >   			   struct drm_printer *p)
> >   {
> > -	u64 start = gpu_buddy_block_offset(block);
> > -	u64 size = gpu_buddy_block_size(mm, block);
> > +	u64 start, size;
> >
> > +	gpu_buddy_driver_lock_held(mm);
> > +	start = gpu_buddy_block_offset(block);
> > +	size = gpu_buddy_block_size(mm, block);
> 
> Same here.

Yes

> 
> >   	drm_printf(p, "%#018llx-%#018llx: %llu\n", start, start + size, size);
> >   }
> >   EXPORT_SYMBOL(drm_buddy_block_print);
> > @@ -42,6 +44,7 @@ void drm_buddy_print(struct gpu_buddy *mm, struct
> drm_printer *p)
> >   {
> >   	int order;
> >
> > +	gpu_buddy_driver_lock_held(mm);
> >   	drm_printf(p, "chunk_size: %lluKiB, total: %lluMiB, free: %lluMiB,
> clear_free: %lluMiB\n",
> >   		   mm->chunk_size >> 10, mm->size >> 20, mm->avail >> 20,
> > mm->clear_avail >> 20);
> >
> > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > index 01a9b92772f8..935e589dd4b0 100644
> > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > @@ -293,7 +293,9 @@ static void xe_ttm_vram_mgr_fini(struct
> drm_device
> > *dev, void *arg)
> >
> >   	WARN_ON_ONCE(mgr->visible_avail != mgr->visible_size);
> >
> > +	mutex_lock(&mgr->lock);
> >   	gpu_buddy_fini(&mgr->mm);
> > +	mutex_unlock(&mgr->lock);
> 
> This shouldn't need a lock. Annotation for this one should also be dropped.

Thought was that while we call buddy_fini, there could be in flight access of buddy manager, so some new allocation in progress and thus it will provide protection against it. Please let me know what you think!

Tejas
> 
> >
> >   	ttm_resource_manager_cleanup(&mgr->manager);
> >
> > @@ -328,6 +330,7 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe,
> struct xe_ttm_vram_mgr *mgr,
> >   	if (err)
> >   		return err;
> >
> > +	gpu_buddy_driver_set_lock(&mgr->mm, &mgr->lock);
> >   	ttm_set_driver_manager(&xe->ttm, mem_type, &mgr->manager);
> >   	ttm_resource_manager_set_used(&mgr->manager, true);
> >
> > diff --git a/include/linux/gpu_buddy.h b/include/linux/gpu_buddy.h
> > index 5fa917ba5450..c174de80ad72 100644
> > --- a/include/linux/gpu_buddy.h
> > +++ b/include/linux/gpu_buddy.h
> > @@ -154,6 +154,7 @@ struct gpu_buddy_block {
> >    * @avail: Total free space currently available for allocation in bytes.
> >    * @clear_avail: Free space available in the clear tree (zeroed memory) in
> bytes.
> >    *               This is a subset of @avail.
> > + * @lock_dep_map: Annotates gpu_buddy API with a driver provided lock.
> >    */
> >   struct gpu_buddy {
> >   /* private: */
> > @@ -179,8 +180,48 @@ struct gpu_buddy {
> >   	u64 size;
> >   	u64 avail;
> >   	u64 clear_avail;
> > +#ifdef CONFIG_LOCKDEP
> > +	struct lockdep_map *lock_dep_map;
> > +#endif
> >   };
> >
> > +#ifdef CONFIG_LOCKDEP
> > +/**
> > + * gpu_buddy_driver_set_lock() - Set the lock protecting accesses to
> > +GPU BUDDY
> > + * @mm: Pointer to GPU buddy structure.
> > + * @lock: the lock used to protect the gpu buddy. The locking
> > +primitive
> > + * must contain a dep_map field.
> > + *
> > + * Call this to annotate gpu_buddy APIs which access/modify gpu_buddy
> > +manager  */ #define gpu_buddy_driver_set_lock(mm, lock) \
> > +	do { \
> > +		struct gpu_buddy *__mm = (mm); \
> > +		if (!WARN(__mm->lock_dep_map, "GPU BUDDY MM lock
> should be set only once.")) \
> > +			__mm->lock_dep_map = &(lock)->dep_map; \
> > +	} while (0)
> > +#else
> > +#define gpu_buddy_driver_set_lock(mm, lock) do { (void)(mm);
> > +(void)(lock); } while (0) #endif
> > +
> > +#ifdef CONFIG_LOCKDEP
> > +/**
> > + * gpu_buddy_driver_lock_held() - Assert GPU BUDDY manager lock is
> > +held
> > + * @mm: Pointer to the GPU BUDDY structure.
> > + *
> > + * Ensure driver lock is held.
> > + */
> > +static inline void gpu_buddy_driver_lock_held(struct gpu_buddy *mm) {
> > +	if ((mm)->lock_dep_map)
> > +		lockdep_assert(lock_is_held_type((mm)->lock_dep_map, 0));
> } #else
> > +static inline gpu_buddy_driver_lock_held(struct gpu_buddy *mm) { }
> > +#endif
> > +
> >   static inline u64
> >   gpu_buddy_block_offset(const struct gpu_buddy_block *block)
> >   {


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for gpu buddy manager
  2026-04-16  9:43     ` Upadhyay, Tejas
@ 2026-04-16  9:56       ` Matthew Auld
  2026-04-16 10:04         ` Upadhyay, Tejas
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Auld @ 2026-04-16  9:56 UTC (permalink / raw)
  To: Upadhyay, Tejas, intel-xe@lists.freedesktop.org
  Cc: Brost, Matthew, thomas.hellstrom@linux.intel.com,
	Ghimiray, Himal Prasad

On 16/04/2026 10:43, Upadhyay, Tejas wrote:
> 
> 
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld@intel.com>
>> Sent: 16 April 2026 14:25
>> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
>> xe@lists.freedesktop.org
>> Cc: Brost, Matthew <matthew.brost@intel.com>;
>> thomas.hellstrom@linux.intel.com; Ghimiray, Himal Prasad
>> <himal.prasad.ghimiray@intel.com>
>> Subject: Re: [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for gpu
>> buddy manager
>>
>> On 16/04/2026 08:49, Tejas Upadhyay wrote:
>>> Integrating lockdep into the gpu_buddy manager as standard practice
>>> for verifying that internal resources are correctly protected by their
>>> associated locks.
>>>
>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
>>> ---
>>>    drivers/gpu/buddy.c                  | 18 ++++++++++--
>>>    drivers/gpu/drm/drm_buddy.c          |  7 +++--
>>>    drivers/gpu/drm/xe/xe_ttm_vram_mgr.c |  3 ++
>>>    include/linux/gpu_buddy.h            | 41 ++++++++++++++++++++++++++++
>>>    4 files changed, 65 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/buddy.c b/drivers/gpu/buddy.c index
>>> 52686672e99f..53ff85ac2105 100644
>>> --- a/drivers/gpu/buddy.c
>>> +++ b/drivers/gpu/buddy.c
>>> @@ -437,6 +437,9 @@ int gpu_buddy_init(struct gpu_buddy *mm, u64
>> size, u64 chunk_size)
>>>    		root_count++;
>>>    	} while (size);
>>>
>>> +#ifdef CONFIG_LOCKDEP
>>> +	mm->lock_dep_map = NULL;
>>> +#endif
>>>    	return 0;
>>>
>>>    out_free_roots:
>>> @@ -464,6 +467,7 @@ void gpu_buddy_fini(struct gpu_buddy *mm)
>>>    	unsigned int order;
>>>    	int i;
>>>
>>> +	gpu_buddy_driver_lock_held(mm);
>>>    	size = mm->size;
>>>
>>>    	for (i = 0; i < mm->n_roots; ++i) { @@ -538,6 +542,7 @@ void
>>> gpu_buddy_reset_clear(struct gpu_buddy *mm, bool is_clear)
>>>    	unsigned int order;
>>>    	int i;
>>>
>>> +	gpu_buddy_driver_lock_held(mm);
>>>    	size = mm->size;
>>>    	for (i = 0; i < mm->n_roots; ++i) {
>>>    		order = ilog2(size) - ilog2(mm->chunk_size); @@ -580,6
>> +585,7 @@
>>> EXPORT_SYMBOL(gpu_buddy_reset_clear);
>>>    void gpu_buddy_free_block(struct gpu_buddy *mm,
>>>    			  struct gpu_buddy_block *block)
>>>    {
>>> +	gpu_buddy_driver_lock_held(mm);
>>>    	BUG_ON(!gpu_buddy_block_is_allocated(block));
>>>    	mm->avail += gpu_buddy_block_size(mm, block);
>>>    	if (gpu_buddy_block_is_clear(block)) @@ -633,6 +639,7 @@ void
>>> gpu_buddy_free_list(struct gpu_buddy *mm,
>>>    {
>>>    	bool mark_clear = flags & GPU_BUDDY_CLEARED;
>>>
>>> +	gpu_buddy_driver_lock_held(mm);
>>>    	__gpu_buddy_free_list(mm, objects, mark_clear, !mark_clear);
>>>    }
>>>    EXPORT_SYMBOL(gpu_buddy_free_list);
>>> @@ -1172,6 +1179,8 @@ int gpu_buddy_block_trim(struct gpu_buddy
>> *mm,
>>>    	u64 new_start;
>>>    	int err;
>>>
>>> +	gpu_buddy_driver_lock_held(mm);
>>> +
>>>    	if (!list_is_singular(blocks))
>>>    		return -EINVAL;
>>>
>>> @@ -1287,6 +1296,8 @@ int gpu_buddy_alloc_blocks(struct gpu_buddy
>> *mm,
>>>    	unsigned long pages;
>>>    	int err;
>>>
>>> +	gpu_buddy_driver_lock_held(mm);
>>> +
>>>    	if (size < mm->chunk_size)
>>>    		return -EINVAL;
>>>
>>> @@ -1458,9 +1469,11 @@ EXPORT_SYMBOL(gpu_buddy_alloc_blocks);
>>>    void gpu_buddy_block_print(struct gpu_buddy *mm,
>>>    			   struct gpu_buddy_block *block)
>>>    {
>>> -	u64 start = gpu_buddy_block_offset(block);
>>> -	u64 size = gpu_buddy_block_size(mm, block);
>>> +	u64 start, size;
>>>
>>> +	gpu_buddy_driver_lock_held(mm);
>>
>> I don't think we want this one. The mm interaction is just for immutable state,
>> and the block itself is essentially owned by the caller. Same reason why we
>> don't want annotations for stuff like
>> gpu_buddy_block_offset() etc.
> 
> Yes, makes sense. I will change it.
> 
>>
>>> +	start = gpu_buddy_block_offset(block);
>>> +	size = gpu_buddy_block_size(mm, block);
>>>    	pr_info("%#018llx-%#018llx: %llu\n", start, start + size, size);
>>>    }
>>>    EXPORT_SYMBOL(gpu_buddy_block_print);
>>> @@ -1475,6 +1488,7 @@ void gpu_buddy_print(struct gpu_buddy *mm)
>>>    {
>>>    	int order;
>>>
>>> +	gpu_buddy_driver_lock_held(mm);
>>>    	pr_info("chunk_size: %lluKiB, total: %lluMiB, free: %lluMiB, clear_free:
>> %lluMiB\n",
>>>    		mm->chunk_size >> 10, mm->size >> 20, mm->avail >> 20,
>>> mm->clear_avail >> 20);
>>>
>>> diff --git a/drivers/gpu/drm/drm_buddy.c b/drivers/gpu/drm/drm_buddy.c
>>> index 841f3de5f307..f4ad09b8a36e 100644
>>> --- a/drivers/gpu/drm/drm_buddy.c
>>> +++ b/drivers/gpu/drm/drm_buddy.c
>>> @@ -25,9 +25,11 @@ void drm_buddy_block_print(struct gpu_buddy
>> *mm,
>>>    			   struct gpu_buddy_block *block,
>>>    			   struct drm_printer *p)
>>>    {
>>> -	u64 start = gpu_buddy_block_offset(block);
>>> -	u64 size = gpu_buddy_block_size(mm, block);
>>> +	u64 start, size;
>>>
>>> +	gpu_buddy_driver_lock_held(mm);
>>> +	start = gpu_buddy_block_offset(block);
>>> +	size = gpu_buddy_block_size(mm, block);
>>
>> Same here.
> 
> Yes
> 
>>
>>>    	drm_printf(p, "%#018llx-%#018llx: %llu\n", start, start + size, size);
>>>    }
>>>    EXPORT_SYMBOL(drm_buddy_block_print);
>>> @@ -42,6 +44,7 @@ void drm_buddy_print(struct gpu_buddy *mm, struct
>> drm_printer *p)
>>>    {
>>>    	int order;
>>>
>>> +	gpu_buddy_driver_lock_held(mm);
>>>    	drm_printf(p, "chunk_size: %lluKiB, total: %lluMiB, free: %lluMiB,
>> clear_free: %lluMiB\n",
>>>    		   mm->chunk_size >> 10, mm->size >> 20, mm->avail >> 20,
>>> mm->clear_avail >> 20);
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
>>> b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
>>> index 01a9b92772f8..935e589dd4b0 100644
>>> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
>>> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
>>> @@ -293,7 +293,9 @@ static void xe_ttm_vram_mgr_fini(struct
>> drm_device
>>> *dev, void *arg)
>>>
>>>    	WARN_ON_ONCE(mgr->visible_avail != mgr->visible_size);
>>>
>>> +	mutex_lock(&mgr->lock);
>>>    	gpu_buddy_fini(&mgr->mm);
>>> +	mutex_unlock(&mgr->lock);
>>
>> This shouldn't need a lock. Annotation for this one should also be dropped.
> 
> Thought was that while we call buddy_fini, there could be in flight access of buddy manager, so some new allocation in progress and thus it will provide protection against it. Please let me know what you think!

The mm stuff is about to get nuked, so if something is still messing 
with this then we are in big trouble anyway, and grabbing the lock 
does't really help much. The fini() here should be triggered as we start 
tearing down the driver.

> 
> Tejas
>>
>>>
>>>    	ttm_resource_manager_cleanup(&mgr->manager);
>>>
>>> @@ -328,6 +330,7 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe,
>> struct xe_ttm_vram_mgr *mgr,
>>>    	if (err)
>>>    		return err;
>>>
>>> +	gpu_buddy_driver_set_lock(&mgr->mm, &mgr->lock);
>>>    	ttm_set_driver_manager(&xe->ttm, mem_type, &mgr->manager);
>>>    	ttm_resource_manager_set_used(&mgr->manager, true);
>>>
>>> diff --git a/include/linux/gpu_buddy.h b/include/linux/gpu_buddy.h
>>> index 5fa917ba5450..c174de80ad72 100644
>>> --- a/include/linux/gpu_buddy.h
>>> +++ b/include/linux/gpu_buddy.h
>>> @@ -154,6 +154,7 @@ struct gpu_buddy_block {
>>>     * @avail: Total free space currently available for allocation in bytes.
>>>     * @clear_avail: Free space available in the clear tree (zeroed memory) in
>> bytes.
>>>     *               This is a subset of @avail.
>>> + * @lock_dep_map: Annotates gpu_buddy API with a driver provided lock.
>>>     */
>>>    struct gpu_buddy {
>>>    /* private: */
>>> @@ -179,8 +180,48 @@ struct gpu_buddy {
>>>    	u64 size;
>>>    	u64 avail;
>>>    	u64 clear_avail;
>>> +#ifdef CONFIG_LOCKDEP
>>> +	struct lockdep_map *lock_dep_map;
>>> +#endif
>>>    };
>>>
>>> +#ifdef CONFIG_LOCKDEP
>>> +/**
>>> + * gpu_buddy_driver_set_lock() - Set the lock protecting accesses to
>>> +GPU BUDDY
>>> + * @mm: Pointer to GPU buddy structure.
>>> + * @lock: the lock used to protect the gpu buddy. The locking
>>> +primitive
>>> + * must contain a dep_map field.
>>> + *
>>> + * Call this to annotate gpu_buddy APIs which access/modify gpu_buddy
>>> +manager  */ #define gpu_buddy_driver_set_lock(mm, lock) \
>>> +	do { \
>>> +		struct gpu_buddy *__mm = (mm); \
>>> +		if (!WARN(__mm->lock_dep_map, "GPU BUDDY MM lock
>> should be set only once.")) \
>>> +			__mm->lock_dep_map = &(lock)->dep_map; \
>>> +	} while (0)
>>> +#else
>>> +#define gpu_buddy_driver_set_lock(mm, lock) do { (void)(mm);
>>> +(void)(lock); } while (0) #endif
>>> +
>>> +#ifdef CONFIG_LOCKDEP
>>> +/**
>>> + * gpu_buddy_driver_lock_held() - Assert GPU BUDDY manager lock is
>>> +held
>>> + * @mm: Pointer to the GPU BUDDY structure.
>>> + *
>>> + * Ensure driver lock is held.
>>> + */
>>> +static inline void gpu_buddy_driver_lock_held(struct gpu_buddy *mm) {
>>> +	if ((mm)->lock_dep_map)
>>> +		lockdep_assert(lock_is_held_type((mm)->lock_dep_map, 0));
>> } #else
>>> +static inline gpu_buddy_driver_lock_held(struct gpu_buddy *mm) { }
>>> +#endif
>>> +
>>>    static inline u64
>>>    gpu_buddy_block_offset(const struct gpu_buddy_block *block)
>>>    {
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for gpu buddy manager
  2026-04-16  9:56       ` Matthew Auld
@ 2026-04-16 10:04         ` Upadhyay, Tejas
  2026-04-16 10:15           ` Matthew Auld
  0 siblings, 1 reply; 19+ messages in thread
From: Upadhyay, Tejas @ 2026-04-16 10:04 UTC (permalink / raw)
  To: Auld, Matthew, intel-xe@lists.freedesktop.org
  Cc: Brost, Matthew, thomas.hellstrom@linux.intel.com,
	Ghimiray, Himal Prasad



> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: 16 April 2026 15:27
> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> xe@lists.freedesktop.org
> Cc: Brost, Matthew <matthew.brost@intel.com>;
> thomas.hellstrom@linux.intel.com; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for gpu
> buddy manager
> 
> On 16/04/2026 10:43, Upadhyay, Tejas wrote:
> >
> >
> >> -----Original Message-----
> >> From: Auld, Matthew <matthew.auld@intel.com>
> >> Sent: 16 April 2026 14:25
> >> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> >> xe@lists.freedesktop.org
> >> Cc: Brost, Matthew <matthew.brost@intel.com>;
> >> thomas.hellstrom@linux.intel.com; Ghimiray, Himal Prasad
> >> <himal.prasad.ghimiray@intel.com>
> >> Subject: Re: [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for
> >> gpu buddy manager
> >>
> >> On 16/04/2026 08:49, Tejas Upadhyay wrote:
> >>> Integrating lockdep into the gpu_buddy manager as standard practice
> >>> for verifying that internal resources are correctly protected by
> >>> their associated locks.
> >>>
> >>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> >>> ---
> >>>    drivers/gpu/buddy.c                  | 18 ++++++++++--
> >>>    drivers/gpu/drm/drm_buddy.c          |  7 +++--
> >>>    drivers/gpu/drm/xe/xe_ttm_vram_mgr.c |  3 ++
> >>>    include/linux/gpu_buddy.h            | 41
> ++++++++++++++++++++++++++++
> >>>    4 files changed, 65 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/buddy.c b/drivers/gpu/buddy.c index
> >>> 52686672e99f..53ff85ac2105 100644
> >>> --- a/drivers/gpu/buddy.c
> >>> +++ b/drivers/gpu/buddy.c
> >>> @@ -437,6 +437,9 @@ int gpu_buddy_init(struct gpu_buddy *mm, u64
> >> size, u64 chunk_size)
> >>>    		root_count++;
> >>>    	} while (size);
> >>>
> >>> +#ifdef CONFIG_LOCKDEP
> >>> +	mm->lock_dep_map = NULL;
> >>> +#endif
> >>>    	return 0;
> >>>
> >>>    out_free_roots:
> >>> @@ -464,6 +467,7 @@ void gpu_buddy_fini(struct gpu_buddy *mm)
> >>>    	unsigned int order;
> >>>    	int i;
> >>>
> >>> +	gpu_buddy_driver_lock_held(mm);
> >>>    	size = mm->size;
> >>>
> >>>    	for (i = 0; i < mm->n_roots; ++i) { @@ -538,6 +542,7 @@ void
> >>> gpu_buddy_reset_clear(struct gpu_buddy *mm, bool is_clear)
> >>>    	unsigned int order;
> >>>    	int i;
> >>>
> >>> +	gpu_buddy_driver_lock_held(mm);
> >>>    	size = mm->size;
> >>>    	for (i = 0; i < mm->n_roots; ++i) {
> >>>    		order = ilog2(size) - ilog2(mm->chunk_size); @@ -580,6
> >> +585,7 @@
> >>> EXPORT_SYMBOL(gpu_buddy_reset_clear);
> >>>    void gpu_buddy_free_block(struct gpu_buddy *mm,
> >>>    			  struct gpu_buddy_block *block)
> >>>    {
> >>> +	gpu_buddy_driver_lock_held(mm);
> >>>    	BUG_ON(!gpu_buddy_block_is_allocated(block));
> >>>    	mm->avail += gpu_buddy_block_size(mm, block);
> >>>    	if (gpu_buddy_block_is_clear(block)) @@ -633,6 +639,7 @@ void
> >>> gpu_buddy_free_list(struct gpu_buddy *mm,
> >>>    {
> >>>    	bool mark_clear = flags & GPU_BUDDY_CLEARED;
> >>>
> >>> +	gpu_buddy_driver_lock_held(mm);
> >>>    	__gpu_buddy_free_list(mm, objects, mark_clear, !mark_clear);
> >>>    }
> >>>    EXPORT_SYMBOL(gpu_buddy_free_list);
> >>> @@ -1172,6 +1179,8 @@ int gpu_buddy_block_trim(struct gpu_buddy
> >> *mm,
> >>>    	u64 new_start;
> >>>    	int err;
> >>>
> >>> +	gpu_buddy_driver_lock_held(mm);
> >>> +
> >>>    	if (!list_is_singular(blocks))
> >>>    		return -EINVAL;
> >>>
> >>> @@ -1287,6 +1296,8 @@ int gpu_buddy_alloc_blocks(struct gpu_buddy
> >> *mm,
> >>>    	unsigned long pages;
> >>>    	int err;
> >>>
> >>> +	gpu_buddy_driver_lock_held(mm);
> >>> +
> >>>    	if (size < mm->chunk_size)
> >>>    		return -EINVAL;
> >>>
> >>> @@ -1458,9 +1469,11 @@ EXPORT_SYMBOL(gpu_buddy_alloc_blocks);
> >>>    void gpu_buddy_block_print(struct gpu_buddy *mm,
> >>>    			   struct gpu_buddy_block *block)
> >>>    {
> >>> -	u64 start = gpu_buddy_block_offset(block);
> >>> -	u64 size = gpu_buddy_block_size(mm, block);
> >>> +	u64 start, size;
> >>>
> >>> +	gpu_buddy_driver_lock_held(mm);
> >>
> >> I don't think we want this one. The mm interaction is just for
> >> immutable state, and the block itself is essentially owned by the
> >> caller. Same reason why we don't want annotations for stuff like
> >> gpu_buddy_block_offset() etc.
> >
> > Yes, makes sense. I will change it.
> >
> >>
> >>> +	start = gpu_buddy_block_offset(block);
> >>> +	size = gpu_buddy_block_size(mm, block);
> >>>    	pr_info("%#018llx-%#018llx: %llu\n", start, start + size, size);
> >>>    }
> >>>    EXPORT_SYMBOL(gpu_buddy_block_print);
> >>> @@ -1475,6 +1488,7 @@ void gpu_buddy_print(struct gpu_buddy
> *mm)
> >>>    {
> >>>    	int order;
> >>>
> >>> +	gpu_buddy_driver_lock_held(mm);
> >>>    	pr_info("chunk_size: %lluKiB, total: %lluMiB, free: %lluMiB, clear_free:
> >> %lluMiB\n",
> >>>    		mm->chunk_size >> 10, mm->size >> 20, mm->avail >> 20,
> >>> mm->clear_avail >> 20);
> >>>
> >>> diff --git a/drivers/gpu/drm/drm_buddy.c
> >>> b/drivers/gpu/drm/drm_buddy.c index 841f3de5f307..f4ad09b8a36e
> >>> 100644
> >>> --- a/drivers/gpu/drm/drm_buddy.c
> >>> +++ b/drivers/gpu/drm/drm_buddy.c
> >>> @@ -25,9 +25,11 @@ void drm_buddy_block_print(struct gpu_buddy
> >> *mm,
> >>>    			   struct gpu_buddy_block *block,
> >>>    			   struct drm_printer *p)
> >>>    {
> >>> -	u64 start = gpu_buddy_block_offset(block);
> >>> -	u64 size = gpu_buddy_block_size(mm, block);
> >>> +	u64 start, size;
> >>>
> >>> +	gpu_buddy_driver_lock_held(mm);
> >>> +	start = gpu_buddy_block_offset(block);
> >>> +	size = gpu_buddy_block_size(mm, block);
> >>
> >> Same here.
> >
> > Yes
> >
> >>
> >>>    	drm_printf(p, "%#018llx-%#018llx: %llu\n", start, start + size, size);
> >>>    }
> >>>    EXPORT_SYMBOL(drm_buddy_block_print);
> >>> @@ -42,6 +44,7 @@ void drm_buddy_print(struct gpu_buddy *mm,
> struct
> >> drm_printer *p)
> >>>    {
> >>>    	int order;
> >>>
> >>> +	gpu_buddy_driver_lock_held(mm);
> >>>    	drm_printf(p, "chunk_size: %lluKiB, total: %lluMiB, free:
> >>> %lluMiB,
> >> clear_free: %lluMiB\n",
> >>>    		   mm->chunk_size >> 10, mm->size >> 20, mm->avail >> 20,
> >>> mm->clear_avail >> 20);
> >>>
> >>> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> >>> b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> >>> index 01a9b92772f8..935e589dd4b0 100644
> >>> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> >>> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> >>> @@ -293,7 +293,9 @@ static void xe_ttm_vram_mgr_fini(struct
> >> drm_device
> >>> *dev, void *arg)
> >>>
> >>>    	WARN_ON_ONCE(mgr->visible_avail != mgr->visible_size);
> >>>
> >>> +	mutex_lock(&mgr->lock);
> >>>    	gpu_buddy_fini(&mgr->mm);
> >>> +	mutex_unlock(&mgr->lock);
> >>
> >> This shouldn't need a lock. Annotation for this one should also be dropped.
> >
> > Thought was that while we call buddy_fini, there could be in flight access of
> buddy manager, so some new allocation in progress and thus it will provide
> protection against it. Please let me know what you think!
> 
> The mm stuff is about to get nuked, so if something is still messing with this
> then we are in big trouble anyway, and grabbing the lock does't really help
> much. The fini() here should be triggered as we start tearing down the driver.

Right, the driver is going away at this stage, but thinking was that at least it will allow in flight things who hold lock to finish before  fini proceeds. But I am not too sure though about there will be such situation hit in practical or not. I am ok to remove this.

Tejas
> 
> >
> > Tejas
> >>
> >>>
> >>>    	ttm_resource_manager_cleanup(&mgr->manager);
> >>>
> >>> @@ -328,6 +330,7 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe,
> >> struct xe_ttm_vram_mgr *mgr,
> >>>    	if (err)
> >>>    		return err;
> >>>
> >>> +	gpu_buddy_driver_set_lock(&mgr->mm, &mgr->lock);
> >>>    	ttm_set_driver_manager(&xe->ttm, mem_type, &mgr->manager);
> >>>    	ttm_resource_manager_set_used(&mgr->manager, true);
> >>>
> >>> diff --git a/include/linux/gpu_buddy.h b/include/linux/gpu_buddy.h
> >>> index 5fa917ba5450..c174de80ad72 100644
> >>> --- a/include/linux/gpu_buddy.h
> >>> +++ b/include/linux/gpu_buddy.h
> >>> @@ -154,6 +154,7 @@ struct gpu_buddy_block {
> >>>     * @avail: Total free space currently available for allocation in bytes.
> >>>     * @clear_avail: Free space available in the clear tree (zeroed memory) in
> >> bytes.
> >>>     *               This is a subset of @avail.
> >>> + * @lock_dep_map: Annotates gpu_buddy API with a driver provided
> lock.
> >>>     */
> >>>    struct gpu_buddy {
> >>>    /* private: */
> >>> @@ -179,8 +180,48 @@ struct gpu_buddy {
> >>>    	u64 size;
> >>>    	u64 avail;
> >>>    	u64 clear_avail;
> >>> +#ifdef CONFIG_LOCKDEP
> >>> +	struct lockdep_map *lock_dep_map;
> >>> +#endif
> >>>    };
> >>>
> >>> +#ifdef CONFIG_LOCKDEP
> >>> +/**
> >>> + * gpu_buddy_driver_set_lock() - Set the lock protecting accesses to
> >>> +GPU BUDDY
> >>> + * @mm: Pointer to GPU buddy structure.
> >>> + * @lock: the lock used to protect the gpu buddy. The locking
> >>> +primitive
> >>> + * must contain a dep_map field.
> >>> + *
> >>> + * Call this to annotate gpu_buddy APIs which access/modify gpu_buddy
> >>> +manager  */ #define gpu_buddy_driver_set_lock(mm, lock) \
> >>> +	do { \
> >>> +		struct gpu_buddy *__mm = (mm); \
> >>> +		if (!WARN(__mm->lock_dep_map, "GPU BUDDY MM lock
> >> should be set only once.")) \
> >>> +			__mm->lock_dep_map = &(lock)->dep_map; \
> >>> +	} while (0)
> >>> +#else
> >>> +#define gpu_buddy_driver_set_lock(mm, lock) do { (void)(mm);
> >>> +(void)(lock); } while (0) #endif
> >>> +
> >>> +#ifdef CONFIG_LOCKDEP
> >>> +/**
> >>> + * gpu_buddy_driver_lock_held() - Assert GPU BUDDY manager lock is
> >>> +held
> >>> + * @mm: Pointer to the GPU BUDDY structure.
> >>> + *
> >>> + * Ensure driver lock is held.
> >>> + */
> >>> +static inline void gpu_buddy_driver_lock_held(struct gpu_buddy *mm) {
> >>> +	if ((mm)->lock_dep_map)
> >>> +		lockdep_assert(lock_is_held_type((mm)->lock_dep_map, 0));
> >> } #else
> >>> +static inline gpu_buddy_driver_lock_held(struct gpu_buddy *mm) { }
> >>> +#endif
> >>> +
> >>>    static inline u64
> >>>    gpu_buddy_block_offset(const struct gpu_buddy_block *block)
> >>>    {
> >


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for gpu buddy manager
  2026-04-16 10:04         ` Upadhyay, Tejas
@ 2026-04-16 10:15           ` Matthew Auld
  2026-04-16 10:18             ` Upadhyay, Tejas
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Auld @ 2026-04-16 10:15 UTC (permalink / raw)
  To: Upadhyay, Tejas, intel-xe@lists.freedesktop.org
  Cc: Brost, Matthew, thomas.hellstrom@linux.intel.com,
	Ghimiray, Himal Prasad

On 16/04/2026 11:04, Upadhyay, Tejas wrote:
> 
> 
>> -----Original Message-----
>> From: Auld, Matthew <matthew.auld@intel.com>
>> Sent: 16 April 2026 15:27
>> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
>> xe@lists.freedesktop.org
>> Cc: Brost, Matthew <matthew.brost@intel.com>;
>> thomas.hellstrom@linux.intel.com; Ghimiray, Himal Prasad
>> <himal.prasad.ghimiray@intel.com>
>> Subject: Re: [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for gpu
>> buddy manager
>>
>> On 16/04/2026 10:43, Upadhyay, Tejas wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Auld, Matthew <matthew.auld@intel.com>
>>>> Sent: 16 April 2026 14:25
>>>> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
>>>> xe@lists.freedesktop.org
>>>> Cc: Brost, Matthew <matthew.brost@intel.com>;
>>>> thomas.hellstrom@linux.intel.com; Ghimiray, Himal Prasad
>>>> <himal.prasad.ghimiray@intel.com>
>>>> Subject: Re: [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for
>>>> gpu buddy manager
>>>>
>>>> On 16/04/2026 08:49, Tejas Upadhyay wrote:
>>>>> Integrating lockdep into the gpu_buddy manager as standard practice
>>>>> for verifying that internal resources are correctly protected by
>>>>> their associated locks.
>>>>>
>>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
>>>>> ---
>>>>>     drivers/gpu/buddy.c                  | 18 ++++++++++--
>>>>>     drivers/gpu/drm/drm_buddy.c          |  7 +++--
>>>>>     drivers/gpu/drm/xe/xe_ttm_vram_mgr.c |  3 ++
>>>>>     include/linux/gpu_buddy.h            | 41
>> ++++++++++++++++++++++++++++
>>>>>     4 files changed, 65 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/buddy.c b/drivers/gpu/buddy.c index
>>>>> 52686672e99f..53ff85ac2105 100644
>>>>> --- a/drivers/gpu/buddy.c
>>>>> +++ b/drivers/gpu/buddy.c
>>>>> @@ -437,6 +437,9 @@ int gpu_buddy_init(struct gpu_buddy *mm, u64
>>>> size, u64 chunk_size)
>>>>>     		root_count++;
>>>>>     	} while (size);
>>>>>
>>>>> +#ifdef CONFIG_LOCKDEP
>>>>> +	mm->lock_dep_map = NULL;
>>>>> +#endif
>>>>>     	return 0;
>>>>>
>>>>>     out_free_roots:
>>>>> @@ -464,6 +467,7 @@ void gpu_buddy_fini(struct gpu_buddy *mm)
>>>>>     	unsigned int order;
>>>>>     	int i;
>>>>>
>>>>> +	gpu_buddy_driver_lock_held(mm);
>>>>>     	size = mm->size;
>>>>>
>>>>>     	for (i = 0; i < mm->n_roots; ++i) { @@ -538,6 +542,7 @@ void
>>>>> gpu_buddy_reset_clear(struct gpu_buddy *mm, bool is_clear)
>>>>>     	unsigned int order;
>>>>>     	int i;
>>>>>
>>>>> +	gpu_buddy_driver_lock_held(mm);
>>>>>     	size = mm->size;
>>>>>     	for (i = 0; i < mm->n_roots; ++i) {
>>>>>     		order = ilog2(size) - ilog2(mm->chunk_size); @@ -580,6
>>>> +585,7 @@
>>>>> EXPORT_SYMBOL(gpu_buddy_reset_clear);
>>>>>     void gpu_buddy_free_block(struct gpu_buddy *mm,
>>>>>     			  struct gpu_buddy_block *block)
>>>>>     {
>>>>> +	gpu_buddy_driver_lock_held(mm);
>>>>>     	BUG_ON(!gpu_buddy_block_is_allocated(block));
>>>>>     	mm->avail += gpu_buddy_block_size(mm, block);
>>>>>     	if (gpu_buddy_block_is_clear(block)) @@ -633,6 +639,7 @@ void
>>>>> gpu_buddy_free_list(struct gpu_buddy *mm,
>>>>>     {
>>>>>     	bool mark_clear = flags & GPU_BUDDY_CLEARED;
>>>>>
>>>>> +	gpu_buddy_driver_lock_held(mm);
>>>>>     	__gpu_buddy_free_list(mm, objects, mark_clear, !mark_clear);
>>>>>     }
>>>>>     EXPORT_SYMBOL(gpu_buddy_free_list);
>>>>> @@ -1172,6 +1179,8 @@ int gpu_buddy_block_trim(struct gpu_buddy
>>>> *mm,
>>>>>     	u64 new_start;
>>>>>     	int err;
>>>>>
>>>>> +	gpu_buddy_driver_lock_held(mm);
>>>>> +
>>>>>     	if (!list_is_singular(blocks))
>>>>>     		return -EINVAL;
>>>>>
>>>>> @@ -1287,6 +1296,8 @@ int gpu_buddy_alloc_blocks(struct gpu_buddy
>>>> *mm,
>>>>>     	unsigned long pages;
>>>>>     	int err;
>>>>>
>>>>> +	gpu_buddy_driver_lock_held(mm);
>>>>> +
>>>>>     	if (size < mm->chunk_size)
>>>>>     		return -EINVAL;
>>>>>
>>>>> @@ -1458,9 +1469,11 @@ EXPORT_SYMBOL(gpu_buddy_alloc_blocks);
>>>>>     void gpu_buddy_block_print(struct gpu_buddy *mm,
>>>>>     			   struct gpu_buddy_block *block)
>>>>>     {
>>>>> -	u64 start = gpu_buddy_block_offset(block);
>>>>> -	u64 size = gpu_buddy_block_size(mm, block);
>>>>> +	u64 start, size;
>>>>>
>>>>> +	gpu_buddy_driver_lock_held(mm);
>>>>
>>>> I don't think we want this one. The mm interaction is just for
>>>> immutable state, and the block itself is essentially owned by the
>>>> caller. Same reason why we don't want annotations for stuff like
>>>> gpu_buddy_block_offset() etc.
>>>
>>> Yes, makes sense. I will change it.
>>>
>>>>
>>>>> +	start = gpu_buddy_block_offset(block);
>>>>> +	size = gpu_buddy_block_size(mm, block);
>>>>>     	pr_info("%#018llx-%#018llx: %llu\n", start, start + size, size);
>>>>>     }
>>>>>     EXPORT_SYMBOL(gpu_buddy_block_print);
>>>>> @@ -1475,6 +1488,7 @@ void gpu_buddy_print(struct gpu_buddy
>> *mm)
>>>>>     {
>>>>>     	int order;
>>>>>
>>>>> +	gpu_buddy_driver_lock_held(mm);
>>>>>     	pr_info("chunk_size: %lluKiB, total: %lluMiB, free: %lluMiB, clear_free:
>>>> %lluMiB\n",
>>>>>     		mm->chunk_size >> 10, mm->size >> 20, mm->avail >> 20,
>>>>> mm->clear_avail >> 20);
>>>>>
>>>>> diff --git a/drivers/gpu/drm/drm_buddy.c
>>>>> b/drivers/gpu/drm/drm_buddy.c index 841f3de5f307..f4ad09b8a36e
>>>>> 100644
>>>>> --- a/drivers/gpu/drm/drm_buddy.c
>>>>> +++ b/drivers/gpu/drm/drm_buddy.c
>>>>> @@ -25,9 +25,11 @@ void drm_buddy_block_print(struct gpu_buddy
>>>> *mm,
>>>>>     			   struct gpu_buddy_block *block,
>>>>>     			   struct drm_printer *p)
>>>>>     {
>>>>> -	u64 start = gpu_buddy_block_offset(block);
>>>>> -	u64 size = gpu_buddy_block_size(mm, block);
>>>>> +	u64 start, size;
>>>>>
>>>>> +	gpu_buddy_driver_lock_held(mm);
>>>>> +	start = gpu_buddy_block_offset(block);
>>>>> +	size = gpu_buddy_block_size(mm, block);
>>>>
>>>> Same here.
>>>
>>> Yes
>>>
>>>>
>>>>>     	drm_printf(p, "%#018llx-%#018llx: %llu\n", start, start + size, size);
>>>>>     }
>>>>>     EXPORT_SYMBOL(drm_buddy_block_print);
>>>>> @@ -42,6 +44,7 @@ void drm_buddy_print(struct gpu_buddy *mm,
>> struct
>>>> drm_printer *p)
>>>>>     {
>>>>>     	int order;
>>>>>
>>>>> +	gpu_buddy_driver_lock_held(mm);
>>>>>     	drm_printf(p, "chunk_size: %lluKiB, total: %lluMiB, free:
>>>>> %lluMiB,
>>>> clear_free: %lluMiB\n",
>>>>>     		   mm->chunk_size >> 10, mm->size >> 20, mm->avail >> 20,
>>>>> mm->clear_avail >> 20);
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
>>>>> b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
>>>>> index 01a9b92772f8..935e589dd4b0 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
>>>>> @@ -293,7 +293,9 @@ static void xe_ttm_vram_mgr_fini(struct
>>>> drm_device
>>>>> *dev, void *arg)
>>>>>
>>>>>     	WARN_ON_ONCE(mgr->visible_avail != mgr->visible_size);
>>>>>
>>>>> +	mutex_lock(&mgr->lock);
>>>>>     	gpu_buddy_fini(&mgr->mm);
>>>>> +	mutex_unlock(&mgr->lock);
>>>>
>>>> This shouldn't need a lock. Annotation for this one should also be dropped.
>>>
>>> Thought was that while we call buddy_fini, there could be in flight access of
>> buddy manager, so some new allocation in progress and thus it will provide
>> protection against it. Please let me know what you think!
>>
>> The mm stuff is about to get nuked, so if something is still messing with this
>> then we are in big trouble anyway, and grabbing the lock does't really help
>> much. The fini() here should be triggered as we start tearing down the driver.
> 
> Right, the driver is going away at this stage, but thinking was that at least it will allow in flight things who hold lock to finish before  fini proceeds. But I am not too sure though about there will be such situation hit in practical or not. I am ok to remove this.

Right, but grabbing the lock won't help much here, I think. As soon you 
drop it here, the other place probably still goes boom anyway, since the 
mm state is all gone.

If this is a concern, I think it would need to solved in a different 
way. We do have various WARN_ON() if we detect still in-use blocks, but 
essentially the bug is not here, and likely someone just forgot some 
put() somewhere.

> 
> Tejas
>>
>>>
>>> Tejas
>>>>
>>>>>
>>>>>     	ttm_resource_manager_cleanup(&mgr->manager);
>>>>>
>>>>> @@ -328,6 +330,7 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe,
>>>> struct xe_ttm_vram_mgr *mgr,
>>>>>     	if (err)
>>>>>     		return err;
>>>>>
>>>>> +	gpu_buddy_driver_set_lock(&mgr->mm, &mgr->lock);
>>>>>     	ttm_set_driver_manager(&xe->ttm, mem_type, &mgr->manager);
>>>>>     	ttm_resource_manager_set_used(&mgr->manager, true);
>>>>>
>>>>> diff --git a/include/linux/gpu_buddy.h b/include/linux/gpu_buddy.h
>>>>> index 5fa917ba5450..c174de80ad72 100644
>>>>> --- a/include/linux/gpu_buddy.h
>>>>> +++ b/include/linux/gpu_buddy.h
>>>>> @@ -154,6 +154,7 @@ struct gpu_buddy_block {
>>>>>      * @avail: Total free space currently available for allocation in bytes.
>>>>>      * @clear_avail: Free space available in the clear tree (zeroed memory) in
>>>> bytes.
>>>>>      *               This is a subset of @avail.
>>>>> + * @lock_dep_map: Annotates gpu_buddy API with a driver provided
>> lock.
>>>>>      */
>>>>>     struct gpu_buddy {
>>>>>     /* private: */
>>>>> @@ -179,8 +180,48 @@ struct gpu_buddy {
>>>>>     	u64 size;
>>>>>     	u64 avail;
>>>>>     	u64 clear_avail;
>>>>> +#ifdef CONFIG_LOCKDEP
>>>>> +	struct lockdep_map *lock_dep_map;
>>>>> +#endif
>>>>>     };
>>>>>
>>>>> +#ifdef CONFIG_LOCKDEP
>>>>> +/**
>>>>> + * gpu_buddy_driver_set_lock() - Set the lock protecting accesses to
>>>>> +GPU BUDDY
>>>>> + * @mm: Pointer to GPU buddy structure.
>>>>> + * @lock: the lock used to protect the gpu buddy. The locking
>>>>> +primitive
>>>>> + * must contain a dep_map field.
>>>>> + *
>>>>> + * Call this to annotate gpu_buddy APIs which access/modify gpu_buddy
>>>>> +manager  */ #define gpu_buddy_driver_set_lock(mm, lock) \
>>>>> +	do { \
>>>>> +		struct gpu_buddy *__mm = (mm); \
>>>>> +		if (!WARN(__mm->lock_dep_map, "GPU BUDDY MM lock
>>>> should be set only once.")) \
>>>>> +			__mm->lock_dep_map = &(lock)->dep_map; \
>>>>> +	} while (0)
>>>>> +#else
>>>>> +#define gpu_buddy_driver_set_lock(mm, lock) do { (void)(mm);
>>>>> +(void)(lock); } while (0) #endif
>>>>> +
>>>>> +#ifdef CONFIG_LOCKDEP
>>>>> +/**
>>>>> + * gpu_buddy_driver_lock_held() - Assert GPU BUDDY manager lock is
>>>>> +held
>>>>> + * @mm: Pointer to the GPU BUDDY structure.
>>>>> + *
>>>>> + * Ensure driver lock is held.
>>>>> + */
>>>>> +static inline void gpu_buddy_driver_lock_held(struct gpu_buddy *mm) {
>>>>> +	if ((mm)->lock_dep_map)
>>>>> +		lockdep_assert(lock_is_held_type((mm)->lock_dep_map, 0));
>>>> } #else
>>>>> +static inline gpu_buddy_driver_lock_held(struct gpu_buddy *mm) { }
>>>>> +#endif
>>>>> +
>>>>>     static inline u64
>>>>>     gpu_buddy_block_offset(const struct gpu_buddy_block *block)
>>>>>     {
>>>
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for gpu buddy manager
  2026-04-16 10:15           ` Matthew Auld
@ 2026-04-16 10:18             ` Upadhyay, Tejas
  0 siblings, 0 replies; 19+ messages in thread
From: Upadhyay, Tejas @ 2026-04-16 10:18 UTC (permalink / raw)
  To: Auld, Matthew, intel-xe@lists.freedesktop.org
  Cc: Brost, Matthew, thomas.hellstrom@linux.intel.com,
	Ghimiray, Himal Prasad



> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: 16 April 2026 15:46
> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> xe@lists.freedesktop.org
> Cc: Brost, Matthew <matthew.brost@intel.com>;
> thomas.hellstrom@linux.intel.com; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for gpu
> buddy manager
> 
> On 16/04/2026 11:04, Upadhyay, Tejas wrote:
> >
> >
> >> -----Original Message-----
> >> From: Auld, Matthew <matthew.auld@intel.com>
> >> Sent: 16 April 2026 15:27
> >> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> >> xe@lists.freedesktop.org
> >> Cc: Brost, Matthew <matthew.brost@intel.com>;
> >> thomas.hellstrom@linux.intel.com; Ghimiray, Himal Prasad
> >> <himal.prasad.ghimiray@intel.com>
> >> Subject: Re: [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for
> >> gpu buddy manager
> >>
> >> On 16/04/2026 10:43, Upadhyay, Tejas wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Auld, Matthew <matthew.auld@intel.com>
> >>>> Sent: 16 April 2026 14:25
> >>>> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> >>>> xe@lists.freedesktop.org
> >>>> Cc: Brost, Matthew <matthew.brost@intel.com>;
> >>>> thomas.hellstrom@linux.intel.com; Ghimiray, Himal Prasad
> >>>> <himal.prasad.ghimiray@intel.com>
> >>>> Subject: Re: [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for
> >>>> gpu buddy manager
> >>>>
> >>>> On 16/04/2026 08:49, Tejas Upadhyay wrote:
> >>>>> Integrating lockdep into the gpu_buddy manager as standard
> >>>>> practice for verifying that internal resources are correctly
> >>>>> protected by their associated locks.
> >>>>>
> >>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> >>>>> ---
> >>>>>     drivers/gpu/buddy.c                  | 18 ++++++++++--
> >>>>>     drivers/gpu/drm/drm_buddy.c          |  7 +++--
> >>>>>     drivers/gpu/drm/xe/xe_ttm_vram_mgr.c |  3 ++
> >>>>>     include/linux/gpu_buddy.h            | 41
> >> ++++++++++++++++++++++++++++
> >>>>>     4 files changed, 65 insertions(+), 4 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/buddy.c b/drivers/gpu/buddy.c index
> >>>>> 52686672e99f..53ff85ac2105 100644
> >>>>> --- a/drivers/gpu/buddy.c
> >>>>> +++ b/drivers/gpu/buddy.c
> >>>>> @@ -437,6 +437,9 @@ int gpu_buddy_init(struct gpu_buddy *mm,
> u64
> >>>> size, u64 chunk_size)
> >>>>>     		root_count++;
> >>>>>     	} while (size);
> >>>>>
> >>>>> +#ifdef CONFIG_LOCKDEP
> >>>>> +	mm->lock_dep_map = NULL;
> >>>>> +#endif
> >>>>>     	return 0;
> >>>>>
> >>>>>     out_free_roots:
> >>>>> @@ -464,6 +467,7 @@ void gpu_buddy_fini(struct gpu_buddy *mm)
> >>>>>     	unsigned int order;
> >>>>>     	int i;
> >>>>>
> >>>>> +	gpu_buddy_driver_lock_held(mm);
> >>>>>     	size = mm->size;
> >>>>>
> >>>>>     	for (i = 0; i < mm->n_roots; ++i) { @@ -538,6 +542,7 @@ void
> >>>>> gpu_buddy_reset_clear(struct gpu_buddy *mm, bool is_clear)
> >>>>>     	unsigned int order;
> >>>>>     	int i;
> >>>>>
> >>>>> +	gpu_buddy_driver_lock_held(mm);
> >>>>>     	size = mm->size;
> >>>>>     	for (i = 0; i < mm->n_roots; ++i) {
> >>>>>     		order = ilog2(size) - ilog2(mm->chunk_size); @@ -
> 580,6
> >>>> +585,7 @@
> >>>>> EXPORT_SYMBOL(gpu_buddy_reset_clear);
> >>>>>     void gpu_buddy_free_block(struct gpu_buddy *mm,
> >>>>>     			  struct gpu_buddy_block *block)
> >>>>>     {
> >>>>> +	gpu_buddy_driver_lock_held(mm);
> >>>>>     	BUG_ON(!gpu_buddy_block_is_allocated(block));
> >>>>>     	mm->avail += gpu_buddy_block_size(mm, block);
> >>>>>     	if (gpu_buddy_block_is_clear(block)) @@ -633,6 +639,7 @@
> void
> >>>>> gpu_buddy_free_list(struct gpu_buddy *mm,
> >>>>>     {
> >>>>>     	bool mark_clear = flags & GPU_BUDDY_CLEARED;
> >>>>>
> >>>>> +	gpu_buddy_driver_lock_held(mm);
> >>>>>     	__gpu_buddy_free_list(mm, objects, mark_clear,
> !mark_clear);
> >>>>>     }
> >>>>>     EXPORT_SYMBOL(gpu_buddy_free_list);
> >>>>> @@ -1172,6 +1179,8 @@ int gpu_buddy_block_trim(struct
> gpu_buddy
> >>>> *mm,
> >>>>>     	u64 new_start;
> >>>>>     	int err;
> >>>>>
> >>>>> +	gpu_buddy_driver_lock_held(mm);
> >>>>> +
> >>>>>     	if (!list_is_singular(blocks))
> >>>>>     		return -EINVAL;
> >>>>>
> >>>>> @@ -1287,6 +1296,8 @@ int gpu_buddy_alloc_blocks(struct
> gpu_buddy
> >>>> *mm,
> >>>>>     	unsigned long pages;
> >>>>>     	int err;
> >>>>>
> >>>>> +	gpu_buddy_driver_lock_held(mm);
> >>>>> +
> >>>>>     	if (size < mm->chunk_size)
> >>>>>     		return -EINVAL;
> >>>>>
> >>>>> @@ -1458,9 +1469,11 @@
> EXPORT_SYMBOL(gpu_buddy_alloc_blocks);
> >>>>>     void gpu_buddy_block_print(struct gpu_buddy *mm,
> >>>>>     			   struct gpu_buddy_block *block)
> >>>>>     {
> >>>>> -	u64 start = gpu_buddy_block_offset(block);
> >>>>> -	u64 size = gpu_buddy_block_size(mm, block);
> >>>>> +	u64 start, size;
> >>>>>
> >>>>> +	gpu_buddy_driver_lock_held(mm);
> >>>>
> >>>> I don't think we want this one. The mm interaction is just for
> >>>> immutable state, and the block itself is essentially owned by the
> >>>> caller. Same reason why we don't want annotations for stuff like
> >>>> gpu_buddy_block_offset() etc.
> >>>
> >>> Yes, makes sense. I will change it.
> >>>
> >>>>
> >>>>> +	start = gpu_buddy_block_offset(block);
> >>>>> +	size = gpu_buddy_block_size(mm, block);
> >>>>>     	pr_info("%#018llx-%#018llx: %llu\n", start, start + size, size);
> >>>>>     }
> >>>>>     EXPORT_SYMBOL(gpu_buddy_block_print);
> >>>>> @@ -1475,6 +1488,7 @@ void gpu_buddy_print(struct gpu_buddy
> >> *mm)
> >>>>>     {
> >>>>>     	int order;
> >>>>>
> >>>>> +	gpu_buddy_driver_lock_held(mm);
> >>>>>     	pr_info("chunk_size: %lluKiB, total: %lluMiB, free: %lluMiB,
> clear_free:
> >>>> %lluMiB\n",
> >>>>>     		mm->chunk_size >> 10, mm->size >> 20, mm->avail >>
> 20,
> >>>>> mm->clear_avail >> 20);
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/drm_buddy.c
> >>>>> b/drivers/gpu/drm/drm_buddy.c index 841f3de5f307..f4ad09b8a36e
> >>>>> 100644
> >>>>> --- a/drivers/gpu/drm/drm_buddy.c
> >>>>> +++ b/drivers/gpu/drm/drm_buddy.c
> >>>>> @@ -25,9 +25,11 @@ void drm_buddy_block_print(struct gpu_buddy
> >>>> *mm,
> >>>>>     			   struct gpu_buddy_block *block,
> >>>>>     			   struct drm_printer *p)
> >>>>>     {
> >>>>> -	u64 start = gpu_buddy_block_offset(block);
> >>>>> -	u64 size = gpu_buddy_block_size(mm, block);
> >>>>> +	u64 start, size;
> >>>>>
> >>>>> +	gpu_buddy_driver_lock_held(mm);
> >>>>> +	start = gpu_buddy_block_offset(block);
> >>>>> +	size = gpu_buddy_block_size(mm, block);
> >>>>
> >>>> Same here.
> >>>
> >>> Yes
> >>>
> >>>>
> >>>>>     	drm_printf(p, "%#018llx-%#018llx: %llu\n", start, start + size,
> size);
> >>>>>     }
> >>>>>     EXPORT_SYMBOL(drm_buddy_block_print);
> >>>>> @@ -42,6 +44,7 @@ void drm_buddy_print(struct gpu_buddy *mm,
> >> struct
> >>>> drm_printer *p)
> >>>>>     {
> >>>>>     	int order;
> >>>>>
> >>>>> +	gpu_buddy_driver_lock_held(mm);
> >>>>>     	drm_printf(p, "chunk_size: %lluKiB, total: %lluMiB, free:
> >>>>> %lluMiB,
> >>>> clear_free: %lluMiB\n",
> >>>>>     		   mm->chunk_size >> 10, mm->size >> 20, mm->avail
> >> 20,
> >>>>> mm->clear_avail >> 20);
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> >>>>> b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> >>>>> index 01a9b92772f8..935e589dd4b0 100644
> >>>>> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> >>>>> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> >>>>> @@ -293,7 +293,9 @@ static void xe_ttm_vram_mgr_fini(struct
> >>>> drm_device
> >>>>> *dev, void *arg)
> >>>>>
> >>>>>     	WARN_ON_ONCE(mgr->visible_avail != mgr->visible_size);
> >>>>>
> >>>>> +	mutex_lock(&mgr->lock);
> >>>>>     	gpu_buddy_fini(&mgr->mm);
> >>>>> +	mutex_unlock(&mgr->lock);
> >>>>
> >>>> This shouldn't need a lock. Annotation for this one should also be
> dropped.
> >>>
> >>> Thought was that while we call buddy_fini, there could be in flight
> >>> access of
> >> buddy manager, so some new allocation in progress and thus it will
> >> provide protection against it. Please let me know what you think!
> >>
> >> The mm stuff is about to get nuked, so if something is still messing
> >> with this then we are in big trouble anyway, and grabbing the lock
> >> does't really help much. The fini() here should be triggered as we start
> tearing down the driver.
> >
> > Right, the driver is going away at this stage, but thinking was that at least it
> will allow in flight things who hold lock to finish before  fini proceeds. But I am
> not too sure though about there will be such situation hit in practical or not. I
> am ok to remove this.
> 
> Right, but grabbing the lock won't help much here, I think. As soon you drop it
> here, the other place probably still goes boom anyway, since the mm state is
> all gone.
> 
> If this is a concern, I think it would need to solved in a different way. We do
> have various WARN_ON() if we detect still in-use blocks, but essentially the
> bug is not here, and likely someone just forgot some
> put() somewhere.

Ok, I will remove this locking and its annotation.

Tejas
> 
> >
> > Tejas
> >>
> >>>
> >>> Tejas
> >>>>
> >>>>>
> >>>>>     	ttm_resource_manager_cleanup(&mgr->manager);
> >>>>>
> >>>>> @@ -328,6 +330,7 @@ int __xe_ttm_vram_mgr_init(struct xe_device
> *xe,
> >>>> struct xe_ttm_vram_mgr *mgr,
> >>>>>     	if (err)
> >>>>>     		return err;
> >>>>>
> >>>>> +	gpu_buddy_driver_set_lock(&mgr->mm, &mgr->lock);
> >>>>>     	ttm_set_driver_manager(&xe->ttm, mem_type, &mgr-
> >manager);
> >>>>>     	ttm_resource_manager_set_used(&mgr->manager, true);
> >>>>>
> >>>>> diff --git a/include/linux/gpu_buddy.h b/include/linux/gpu_buddy.h
> >>>>> index 5fa917ba5450..c174de80ad72 100644
> >>>>> --- a/include/linux/gpu_buddy.h
> >>>>> +++ b/include/linux/gpu_buddy.h
> >>>>> @@ -154,6 +154,7 @@ struct gpu_buddy_block {
> >>>>>      * @avail: Total free space currently available for allocation in bytes.
> >>>>>      * @clear_avail: Free space available in the clear tree (zeroed memory)
> in
> >>>> bytes.
> >>>>>      *               This is a subset of @avail.
> >>>>> + * @lock_dep_map: Annotates gpu_buddy API with a driver provided
> >> lock.
> >>>>>      */
> >>>>>     struct gpu_buddy {
> >>>>>     /* private: */
> >>>>> @@ -179,8 +180,48 @@ struct gpu_buddy {
> >>>>>     	u64 size;
> >>>>>     	u64 avail;
> >>>>>     	u64 clear_avail;
> >>>>> +#ifdef CONFIG_LOCKDEP
> >>>>> +	struct lockdep_map *lock_dep_map;
> >>>>> +#endif
> >>>>>     };
> >>>>>
> >>>>> +#ifdef CONFIG_LOCKDEP
> >>>>> +/**
> >>>>> + * gpu_buddy_driver_set_lock() - Set the lock protecting accesses to
> >>>>> +GPU BUDDY
> >>>>> + * @mm: Pointer to GPU buddy structure.
> >>>>> + * @lock: the lock used to protect the gpu buddy. The locking
> >>>>> +primitive
> >>>>> + * must contain a dep_map field.
> >>>>> + *
> >>>>> + * Call this to annotate gpu_buddy APIs which access/modify
> gpu_buddy
> >>>>> +manager  */ #define gpu_buddy_driver_set_lock(mm, lock) \
> >>>>> +	do { \
> >>>>> +		struct gpu_buddy *__mm = (mm); \
> >>>>> +		if (!WARN(__mm->lock_dep_map, "GPU BUDDY MM lock
> >>>> should be set only once.")) \
> >>>>> +			__mm->lock_dep_map = &(lock)->dep_map; \
> >>>>> +	} while (0)
> >>>>> +#else
> >>>>> +#define gpu_buddy_driver_set_lock(mm, lock) do { (void)(mm);
> >>>>> +(void)(lock); } while (0) #endif
> >>>>> +
> >>>>> +#ifdef CONFIG_LOCKDEP
> >>>>> +/**
> >>>>> + * gpu_buddy_driver_lock_held() - Assert GPU BUDDY manager lock is
> >>>>> +held
> >>>>> + * @mm: Pointer to the GPU BUDDY structure.
> >>>>> + *
> >>>>> + * Ensure driver lock is held.
> >>>>> + */
> >>>>> +static inline void gpu_buddy_driver_lock_held(struct gpu_buddy *mm)
> {
> >>>>> +	if ((mm)->lock_dep_map)
> >>>>> +		lockdep_assert(lock_is_held_type((mm)->lock_dep_map, 0));
> >>>> } #else
> >>>>> +static inline gpu_buddy_driver_lock_held(struct gpu_buddy *mm) { }
> >>>>> +#endif
> >>>>> +
> >>>>>     static inline u64
> >>>>>     gpu_buddy_block_offset(const struct gpu_buddy_block *block)
> >>>>>     {
> >>>
> >


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-04-16 10:19 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-16  7:49 [RFC PATCH V7 00/10] Add memory page offlining support Tejas Upadhyay
2026-04-16  7:49 ` [RFC PATCH V7 01/10] drm/xe: Link VRAM object with gpu buddy Tejas Upadhyay
2026-04-16  7:49 ` [RFC PATCH V7 02/10] gpu/buddy: Integrate lockdep for gpu buddy manager Tejas Upadhyay
2026-04-16  8:55   ` Matthew Auld
2026-04-16  9:43     ` Upadhyay, Tejas
2026-04-16  9:56       ` Matthew Auld
2026-04-16 10:04         ` Upadhyay, Tejas
2026-04-16 10:15           ` Matthew Auld
2026-04-16 10:18             ` Upadhyay, Tejas
2026-04-16  7:49 ` [RFC PATCH V7 03/10] drm/gpu: Add gpu_buddy_allocated_addr_to_block helper Tejas Upadhyay
2026-04-16  7:49 ` [RFC PATCH V7 04/10] drm/xe: Link LRC BO and its execution Queue Tejas Upadhyay
2026-04-16  7:49 ` [RFC PATCH V7 05/10] drm/xe: Extend BO purge to handle vram pages as well Tejas Upadhyay
2026-04-16  7:49 ` [RFC PATCH V7 06/10] drm/xe: Handle physical memory address error Tejas Upadhyay
2026-04-16  7:49 ` [RFC PATCH V7 07/10] drm/xe/cri: Add debugfs to inject faulty vram address Tejas Upadhyay
2026-04-16  7:49 ` [RFC PATCH V7 08/10] gpu/buddy: Add routine to dump allocated buddy blocks Tejas Upadhyay
2026-04-16  7:49 ` [RFC PATCH V7 09/10] drm/xe/configfs: Add vram bad page reservation policy Tejas Upadhyay
2026-04-16  7:49 ` [RFC PATCH V7 10/10] drm/xe/cri: Add sysfs interface for bad gpu vram pages Tejas Upadhyay
2026-04-16  7:56 ` ✗ CI.checkpatch: warning for Add memory page offlining support (rev8) Patchwork
2026-04-16  7:57 ` ✗ CI.KUnit: failure " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox