Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] Add memory page offlining support
@ 2026-02-13  9:25 Tejas Upadhyay
  2026-02-13  9:25 ` [RFC PATCH 1/6] drm/xe/svm: Use res_to_mem_region Tejas Upadhyay
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Tejas Upadhyay @ 2026-02-13  9:25 UTC (permalink / raw)
  To: intel-xe; +Cc: matthew.auld, matthew.brost, himal.prasad.ghimiray,
	Tejas Upadhyay

This functionality represents a significant step in making
the xe driver gracefully handle hardware memory degradation.
By integrating with the DRM Buddy allocator, the driver
can permanently "carve out" faulty memory so it isn't reused
by subsequent allocations.

This series adds memory page offlining support with following:
1. drm/xe/svm: Use xe_vram_addr_to_region, avoid block->private usage
2. Link and track ttm BO's with physical addresses
3. Handle the generated physical address error by reserving addresses 4K page
4. Adds supporting debugfs to inject manual physcal address error
5. Add buddy block allocation dump for debuggin buddy related issues
6. Sysfs entry to provide statistics of bad gpu vram pages for user info


Opens:
1. dump_allocated_blocks() and xe_ttm_vram_addr_to_tbo() API will move under drm_buddy,
right now just to showcase concept its part of xe code

V3: use res_to_mem_region to avoid use of block->private (MattA)
V2:
- some fixes and clean up on errors
- Added xe_vram_addr_to_region helper to avoid other use of block->private (MattB)

Tejas Upadhyay (6):
  drm/xe/svm: Use res_to_mem_region
  drm/xe: Implement VRAM object tracking ability using physical address
  drm/xe: Handle physical memory address error
  [DO NOT REVIEW]drm/xe/cri: Add debugfs to inject faulty vram address
  drm/xe: Add routine to dump allocated VRAM blocks
  [DO NOT REVIEW]]drm/xe/cri: Add sysfs interface for bad gpu vram pages

 drivers/gpu/drm/xe/xe_bo.c                 |   2 +-
 drivers/gpu/drm/xe/xe_bo.h                 |   1 +
 drivers/gpu/drm/xe/xe_debugfs.c            |  49 +++
 drivers/gpu/drm/xe/xe_device_sysfs.c       |   2 +
 drivers/gpu/drm/xe/xe_svm.c                |   8 +-
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       | 355 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.h       |   6 +-
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  23 ++
 8 files changed, 437 insertions(+), 9 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH 1/6] drm/xe/svm: Use res_to_mem_region
  2026-02-13  9:25 [RFC PATCH 0/6] Add memory page offlining support Tejas Upadhyay
@ 2026-02-13  9:25 ` Tejas Upadhyay
  2026-02-24  2:16   ` Matthew Brost
  2026-02-13  9:25 ` [RFC PATCH 2/6] drm/xe: Implement VRAM object tracking ability using physical address Tejas Upadhyay
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Tejas Upadhyay @ 2026-02-13  9:25 UTC (permalink / raw)
  To: intel-xe; +Cc: matthew.auld, matthew.brost, himal.prasad.ghimiray,
	Tejas Upadhyay

Replace the direct use of block->private with the helper function
res_to_mem_region to get vram region.

V2(MattA): Use res_to_mem_region

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c  | 2 +-
 drivers/gpu/drm/xe/xe_bo.h  | 1 +
 drivers/gpu/drm/xe/xe_svm.c | 8 +-------
 3 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index cb8a177ec02b..70aca621c1a1 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -173,7 +173,7 @@ mem_type_to_migrate(struct xe_device *xe, u32 mem_type)
 	return tile->migrate;
 }
 
-static struct xe_vram_region *res_to_mem_region(struct ttm_resource *res)
+struct xe_vram_region *res_to_mem_region(struct ttm_resource *res)
 {
 	struct xe_device *xe = ttm_to_xe_device(res->bo->bdev);
 	struct ttm_resource_manager *mgr;
diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
index c914ab719f20..393f1b4faf99 100644
--- a/drivers/gpu/drm/xe/xe_bo.h
+++ b/drivers/gpu/drm/xe/xe_bo.h
@@ -311,6 +311,7 @@ int xe_bo_dumb_create(struct drm_file *file_priv,
 		      struct drm_mode_create_dumb *args);
 
 bool xe_bo_needs_ccs_pages(struct xe_bo *bo);
+struct xe_vram_region *res_to_mem_region(struct ttm_resource *res);
 
 static inline size_t xe_bo_ccs_pages_start(struct xe_bo *bo)
 {
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 213f0334518a..8015eb6fcbc9 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -762,7 +762,7 @@ static int xe_svm_populate_devmem_pfn(struct drm_pagemap_devmem *devmem_allocati
 	int j = 0;
 
 	list_for_each_entry(block, blocks, link) {
-		struct xe_vram_region *vr = block->private;
+		struct xe_vram_region *vr = res_to_mem_region(res);
 		struct drm_buddy *buddy = vram_to_buddy(vr);
 		u64 block_pfn = block_offset_to_pfn(devmem_allocation->dpagemap,
 						    drm_buddy_block_offset(block));
@@ -1033,9 +1033,7 @@ static int xe_drm_pagemap_populate_mm(struct drm_pagemap *dpagemap,
 	struct dma_fence *pre_migrate_fence = NULL;
 	struct xe_device *xe = vr->xe;
 	struct device *dev = xe->drm.dev;
-	struct drm_buddy_block *block;
 	struct xe_validation_ctx vctx;
-	struct list_head *blocks;
 	struct drm_exec exec;
 	struct xe_bo *bo;
 	int err = 0, idx;
@@ -1072,10 +1070,6 @@ static int xe_drm_pagemap_populate_mm(struct drm_pagemap *dpagemap,
 					&dpagemap_devmem_ops, dpagemap, end - start,
 					pre_migrate_fence);
 
-		blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
-		list_for_each_entry(block, blocks, link)
-			block->private = vr;
-
 		xe_bo_get(bo);
 
 		/* Ensure the device has a pm ref while there are device pages active. */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 2/6] drm/xe: Implement VRAM object tracking ability using physical address
  2026-02-13  9:25 [RFC PATCH 0/6] Add memory page offlining support Tejas Upadhyay
  2026-02-13  9:25 ` [RFC PATCH 1/6] drm/xe/svm: Use res_to_mem_region Tejas Upadhyay
@ 2026-02-13  9:25 ` Tejas Upadhyay
  2026-02-13  9:25 ` [RFC PATCH 3/6] drm/xe: Handle physical memory address error Tejas Upadhyay
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Tejas Upadhyay @ 2026-02-13  9:25 UTC (permalink / raw)
  To: intel-xe; +Cc: matthew.auld, matthew.brost, himal.prasad.ghimiray,
	Tejas Upadhyay

Implement the capability to track and identify TTM buffer objects
using a specific faulty memory address in VRAM. This functionality
is critical for supporting the memory page offline feature on CRI,
where identified faulty pages must be traced back to their
originating buffer for safe removal.

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 75 ++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.h |  2 +-
 2 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index d6aa61e55f4d..4e852eed5170 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -56,6 +56,7 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager *man,
 	u64 size, min_page_size;
 	unsigned long lpfn;
 	int err;
+	struct drm_buddy_block *block;
 
 	lpfn = place->lpfn;
 	if (!lpfn || lpfn > man->size >> PAGE_SHIFT)
@@ -137,6 +138,8 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager *man,
 	}
 
 	mgr->visible_avail -= vres->used_visible_size;
+	list_for_each_entry(block, &vres->blocks, link)
+		block->private = tbo;
 	mutex_unlock(&mgr->lock);
 
 	if (!(vres->base.placement & TTM_PL_FLAG_CONTIGUOUS) &&
@@ -467,3 +470,75 @@ u64 xe_ttm_vram_get_avail(struct ttm_resource_manager *man)
 
 	return avail;
 }
+
+static inline bool overlaps(u64 s1, u64 e1, u64 s2, u64 e2)
+{
+	return s1 <= e2 && e1 >= s2;
+}
+
+static inline bool contains(u64 s1, u64 e1, u64 s2, u64 e2)
+{
+	return s1 <= s2 && e1 <= e2;
+}
+
+static struct ttm_buffer_object *xe_ttm_vram_addr_to_tbo(struct drm_buddy *mm, u64 start)
+{
+	struct drm_buddy_block *block;
+	u64 end;
+	LIST_HEAD(dfs);
+	int i;
+
+	end = start + SZ_4K - 1;
+	for (i = 0; i < mm->n_roots; ++i)
+		list_add_tail(&mm->roots[i]->tmp_link, &dfs);
+
+	do {
+		u64 block_start;
+		u64 block_end;
+
+		block = list_first_entry_or_null(&dfs,
+						 struct drm_buddy_block,
+						 tmp_link);
+		if (!block)
+			break;
+
+		list_del(&block->tmp_link);
+
+		block_start = drm_buddy_block_offset(block);
+		block_end = block_start + drm_buddy_block_size(mm, block) - 1;
+
+		if (!overlaps(start, end, block_start, block_end))
+			continue;
+
+		if (contains(start, end, block_start, block_end) &&
+		    !drm_buddy_block_is_split(block)) {
+			if (drm_buddy_block_is_free(block)) {
+				return NULL;
+			} else if (drm_buddy_block_is_allocated(block) && !mm->clear_avail) {
+				struct ttm_buffer_object *tbo = block->private;
+
+				WARN_ON(!tbo);
+				return tbo;
+			}
+		}
+
+		if (drm_buddy_block_is_split(block)) {
+			list_add(&block->right->tmp_link, &dfs);
+			list_add(&block->left->tmp_link, &dfs);
+		}
+	} while (1);
+
+	return NULL;
+}
+
+int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr)
+{
+	struct xe_ttm_vram_mgr *vram_mgr = &tile->mem.vram->ttm;
+	struct drm_buddy mm = vram_mgr->mm;
+	struct ttm_buffer_object *tbo;
+
+	tbo = xe_ttm_vram_addr_to_tbo(&mm, addr);
+
+	return 0;
+}
+EXPORT_SYMBOL(xe_ttm_tbo_handle_addr_fault);
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
index 87b7fae5edba..1d6075411ebf 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
@@ -30,7 +30,7 @@ u64 xe_ttm_vram_get_avail(struct ttm_resource_manager *man);
 u64 xe_ttm_vram_get_cpu_visible_size(struct ttm_resource_manager *man);
 void xe_ttm_vram_get_used(struct ttm_resource_manager *man,
 			  u64 *used, u64 *used_visible);
-
+int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr);
 static inline struct xe_ttm_vram_mgr_resource *
 to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
 {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 3/6] drm/xe: Handle physical memory address error
  2026-02-13  9:25 [RFC PATCH 0/6] Add memory page offlining support Tejas Upadhyay
  2026-02-13  9:25 ` [RFC PATCH 1/6] drm/xe/svm: Use res_to_mem_region Tejas Upadhyay
  2026-02-13  9:25 ` [RFC PATCH 2/6] drm/xe: Implement VRAM object tracking ability using physical address Tejas Upadhyay
@ 2026-02-13  9:25 ` Tejas Upadhyay
  2026-02-13  9:25 ` [RFC PATCH 4/6] [DO NOT REVIEW]drm/xe/cri: Add debugfs to inject faulty vram address Tejas Upadhyay
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Tejas Upadhyay @ 2026-02-13  9:25 UTC (permalink / raw)
  To: intel-xe; +Cc: matthew.auld, matthew.brost, himal.prasad.ghimiray,
	Tejas Upadhyay

This functionality represents a significant step in making
the xe driver gracefully handle hardware memory degradation.
By integrating with the DRM Buddy allocator, the driver
can permanently "carve out" faulty memory so it isn't reused
by subsequent allocations.

Buddy Block Reservation:
----------------------
When a memory address is reported as faulty, the driver instructs
the DRM Buddy allocator to reserve a block of the specific page
size (typically 4KB). This marks the memory as "dirty/used"
indefinitely.

Two-Stage Tracking:
-----------------
Offlined Pages:
Pages that have been successfully isolated and removed from the
available memory pool.

Queued Pages:
Addresses that have been flagged as faulty but are currently in
use by a process. These are tracked until the associated buffer
object (BO) is released or migrated, at which point they move
to the "offlined" state.

Sysfs Reporting:
--------------
The patch exposes these metrics through a standard interface,
allowing administrators to monitor VRAM health:
/sys/bus/pci/devices/<device_id>/vram_bad_bad_pages

V2:
- Fix mm->avail counter issue
- Remove unused code and handle clean up in case of error

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       | 184 ++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  21 +++
 2 files changed, 199 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index 4e852eed5170..82d11348e5ce 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -276,6 +276,26 @@ static const struct ttm_resource_manager_func xe_ttm_vram_mgr_func = {
 	.debug	= xe_ttm_vram_mgr_debug
 };
 
+static void xe_ttm_vram_free_bad_pages(struct drm_device *dev, struct xe_ttm_vram_mgr *mgr)
+{
+	struct xe_ttm_offline_resource *pos, *n;
+
+	mutex_lock(&mgr->lock);
+	list_for_each_entry_safe(pos, n, &mgr->offlined_pages, offlined_link) {
+		--mgr->n_offlined_pages;
+		drm_buddy_free_list(&mgr->mm, &pos->blocks, 0);
+		mgr->visible_avail += pos->used_visible_size;
+		list_del(&pos->offlined_link);
+		kfree(pos);
+	}
+	list_for_each_entry_safe(pos, n, &mgr->queued_pages, queued_link) {
+		list_del(&pos->queued_link);
+		mgr->n_queued_pages--;
+		kfree(pos);
+	}
+	mutex_unlock(&mgr->lock);
+}
+
 static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
 {
 	struct xe_device *xe = to_xe_device(dev);
@@ -287,6 +307,8 @@ static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
 	if (ttm_resource_manager_evict_all(&xe->ttm, man))
 		return;
 
+	xe_ttm_vram_free_bad_pages(dev, mgr);
+
 	WARN_ON_ONCE(mgr->visible_avail != mgr->visible_size);
 
 	drm_buddy_fini(&mgr->mm);
@@ -315,6 +337,8 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
 	man->func = &xe_ttm_vram_mgr_func;
 	mgr->mem_type = mem_type;
 	mutex_init(&mgr->lock);
+	INIT_LIST_HEAD(&mgr->offlined_pages);
+	INIT_LIST_HEAD(&mgr->queued_pages);
 	mgr->default_page_size = default_page_size;
 	mgr->visible_size = io_size;
 	mgr->visible_avail = io_size;
@@ -531,14 +555,162 @@ static struct ttm_buffer_object *xe_ttm_vram_addr_to_tbo(struct drm_buddy *mm, u
 	return NULL;
 }
 
-int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr)
+static int xe_ttm_vram_reserve_page_at_addr(struct xe_device *xe, unsigned long addr,
+					    struct xe_ttm_vram_mgr *vram_mgr, struct drm_buddy *mm)
 {
-	struct xe_ttm_vram_mgr *vram_mgr = &tile->mem.vram->ttm;
-	struct drm_buddy mm = vram_mgr->mm;
-	struct ttm_buffer_object *tbo;
+	int ret = 0;
+	u64 size = SZ_4K;
+	struct ttm_buffer_object *tbo = NULL;
+	struct xe_ttm_offline_resource *nentry;
 
-	tbo = xe_ttm_vram_addr_to_tbo(&mm, addr);
+	mutex_lock(&vram_mgr->lock);
+	tbo = xe_ttm_vram_addr_to_tbo(mm, addr);
 
-	return 0;
+	nentry = kzalloc(sizeof(*nentry), GFP_KERNEL);
+	if (!nentry)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&nentry->blocks);
+
+	if (tbo) {
+		struct xe_ttm_vram_mgr_resource *pvres;
+		struct ttm_placement place = {};
+		struct ttm_operation_ctx ctx = {
+			.interruptible = false,
+			.gfp_retry_mayfail = false,
+		};
+		bool locked;
+		struct xe_ttm_offline_resource *pos, *n;
+		struct xe_bo *pbo = ttm_to_xe_bo(tbo);
+
+		xe_bo_get(pbo);
+		/* Critical kernel BO? */
+		if (pbo->ttm.type == ttm_bo_type_kernel &&
+		    !(pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM)) {
+			mutex_unlock(&vram_mgr->lock);
+			kfree(nentry);
+			xe_ttm_vram_free_bad_pages(&xe->drm, vram_mgr);
+			xe_bo_put(pbo);
+			drm_warn(&xe->drm,
+				 "%s: corrupt addr: 0x%lx in critical kernel bo, wedge now\n",
+				__func__, addr);
+			/* Wedge the device */
+			xe_device_declare_wedged(xe);
+			return -EIO;
+		}
+		pvres = to_xe_ttm_vram_mgr_resource(pbo->ttm.resource);
+		nentry->id = ++vram_mgr->n_queued_pages;
+		nentry->blocks = pvres->blocks;
+		list_add(&nentry->queued_link, &vram_mgr->queued_pages);
+		mutex_unlock(&vram_mgr->lock);
+
+		/* Purge BO containing address */
+		spin_lock(&pbo->ttm.bdev->lru_lock);
+		locked = dma_resv_trylock(pbo->ttm.base.resv);
+		spin_unlock(&pbo->ttm.bdev->lru_lock);
+		WARN_ON(!locked);
+		ret = ttm_bo_validate(&pbo->ttm, &place, &ctx);
+		drm_WARN_ON(&xe->drm, ret);
+		xe_bo_put(pbo);
+		if (locked)
+			dma_resv_unlock(pbo->ttm.base.resv);
+
+		/* Reserve page at address addr*/
+		mutex_lock(&vram_mgr->lock);
+		ret = drm_buddy_alloc_blocks(mm, addr, addr + size,
+					     size, size, &nentry->blocks,
+					     DRM_BUDDY_RANGE_ALLOCATION);
+
+		if (ret) {
+			drm_warn(&xe->drm, "Could not reserve page at addr:0x%lx, ret:%d\n",
+				 addr, ret);
+			mutex_unlock(&vram_mgr->lock);
+			return ret;
+		}
+		if ((addr + size) <= vram_mgr->visible_size) {
+			nentry->used_visible_size = size;
+		} else {
+			struct drm_buddy_block *block;
+
+			list_for_each_entry(block, &nentry->blocks, link) {
+				u64 start = drm_buddy_block_offset(block);
+
+				if (start < vram_mgr->visible_size) {
+					u64 end = start + drm_buddy_block_size(mm, block);
+
+					nentry->used_visible_size +=
+						min(end, vram_mgr->visible_size) - start;
+				}
+			}
+		}
+		vram_mgr->visible_avail -= nentry->used_visible_size;
+		list_for_each_entry_safe(pos, n, &vram_mgr->queued_pages, queued_link) {
+			if (pos->id == nentry->id) {
+				--vram_mgr->n_queued_pages;
+				list_del(&pos->queued_link);
+				break;
+			}
+		}
+		list_add(&nentry->offlined_link, &vram_mgr->offlined_pages);
+		++vram_mgr->n_offlined_pages;
+		mutex_unlock(&vram_mgr->lock);
+		return ret;
+
+	} else {
+		ret = drm_buddy_alloc_blocks(mm, addr, addr + size,
+					     size, size, &nentry->blocks,
+					     DRM_BUDDY_RANGE_ALLOCATION);
+		if (ret) {
+			drm_warn(&xe->drm, "Could not reserve page at addr:0x%lx, ret:%d\n",
+				 addr, ret);
+			kfree(nentry);
+			mutex_unlock(&vram_mgr->lock);
+			return ret;
+		}
+		if ((addr + size) <= vram_mgr->visible_size) {
+			nentry->used_visible_size = size;
+		} else {
+			struct drm_buddy_block *block;
+
+			list_for_each_entry(block, &nentry->blocks, link) {
+				u64 start = drm_buddy_block_offset(block);
+
+				if (start < vram_mgr->visible_size) {
+					u64 end = start + drm_buddy_block_size(mm, block);
+
+					nentry->used_visible_size +=
+						min(end, vram_mgr->visible_size) - start;
+				}
+			}
+		}
+		vram_mgr->visible_avail -= nentry->used_visible_size;
+		nentry->id = ++vram_mgr->n_offlined_pages;
+		list_add(&nentry->offlined_link, &vram_mgr->offlined_pages);
+		mutex_unlock(&vram_mgr->lock);
+	}
+	/* Success */
+	return ret;
+}
+
+/**
+ * xe_ttm_tbo_handle_addr_fault - Handle vram physical address error flaged
+ * @tile: pointer to tile where address belongs
+ * @addr: physical faulty address
+ *
+ * Handle the physcial faulty address error on specific tile.
+ *
+ * Returns 0 for success, negative error code otherwise.
+ */
+int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr)
+{
+	struct xe_ttm_vram_mgr *vram_mgr = &tile->mem.vram->ttm;
+	struct xe_device *xe = tile_to_xe(tile);
+	struct drm_buddy *mm = &vram_mgr->mm;
+	int ret;
+
+	 /* Reserve page at address */
+	ret = xe_ttm_vram_reserve_page_at_addr(xe, addr, vram_mgr, mm);
+	if (ret == -EIO)
+		return 0; /* success, wedged by kernel. */
+	return ret;
 }
 EXPORT_SYMBOL(xe_ttm_tbo_handle_addr_fault);
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
index a71e14818ec2..85511b51af75 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
@@ -19,6 +19,14 @@ struct xe_ttm_vram_mgr {
 	struct ttm_resource_manager manager;
 	/** @mm: DRM buddy allocator which manages the VRAM */
 	struct drm_buddy mm;
+	/** @offlined_pages: List of offlined pages */
+	struct list_head offlined_pages;
+	/** @n_offlined_pages: Number of offlined pages */
+	u16 n_offlined_pages;
+	/** @queued_pages: List of queued pages */
+	struct list_head queued_pages;
+       /** @n_queued_pages: Number of queued pages */
+	u16 n_queued_pages;
 	/** @visible_size: Proped size of the CPU visible portion */
 	u64 visible_size;
 	/** @visible_avail: CPU visible portion still unallocated */
@@ -45,4 +53,17 @@ struct xe_ttm_vram_mgr_resource {
 	unsigned long flags;
 };
 
+struct xe_ttm_offline_resource {
+	/** @offlined_link: Link to offlined pages */
+	struct list_head offlined_link;
+	/** @queued_link: Link to queued pages */
+	struct list_head queued_link;
+	/** @blocks: list of DRM buddy blocks */
+	struct list_head blocks;
+	/** @used_visible_size: How many CPU visible bytes this resource is using */
+	u64 used_visible_size;
+	/** @id: The id of an offline resource */
+	u16 id;
+};
+
 #endif
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 4/6] [DO NOT REVIEW]drm/xe/cri: Add debugfs to inject faulty vram address
  2026-02-13  9:25 [RFC PATCH 0/6] Add memory page offlining support Tejas Upadhyay
                   ` (2 preceding siblings ...)
  2026-02-13  9:25 ` [RFC PATCH 3/6] drm/xe: Handle physical memory address error Tejas Upadhyay
@ 2026-02-13  9:25 ` Tejas Upadhyay
  2026-02-13  9:25 ` [RFC PATCH 5/6] drm/xe: Add routine to dump allocated VRAM blocks Tejas Upadhyay
  2026-02-13  9:25 ` [RFC PATCH 6/6] [DO NOT REVIEW]]drm/xe/cri: Add sysfs interface for bad gpu vram pages Tejas Upadhyay
  5 siblings, 0 replies; 13+ messages in thread
From: Tejas Upadhyay @ 2026-02-13  9:25 UTC (permalink / raw)
  To: intel-xe; +Cc: matthew.auld, matthew.brost, himal.prasad.ghimiray,
	Tejas Upadhyay

Add debugfs which can help testing feature with manual error injection.
Adding a debugfs interface to the drm/xe driver allows manual injection
of faulty VRAM addresses, facilitating the testing of the CRI memory
page offline feature before it is fully functional. The implementation
involves creating a debugfs entry, likely under
/sys/kernel/debug/dri/bdf/invalid_addr_vram0,
to accept specific faulty addresses for validation.

For example,
echo 0x1000 > /sys/kernel/debug/dri/bdf/invalid_addr_vram0
where 0x1000 is faulty adress being injected.

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_debugfs.c            | 49 ++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  2 +
 2 files changed, 51 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/xe_debugfs.c
index 844cfafe1ec7..d9dc1acebbce 100644
--- a/drivers/gpu/drm/xe/xe_debugfs.c
+++ b/drivers/gpu/drm/xe/xe_debugfs.c
@@ -27,6 +27,7 @@
 #include "xe_sriov_vf.h"
 #include "xe_step.h"
 #include "xe_tile_debugfs.h"
+#include "xe_ttm_vram_mgr.h"
 #include "xe_vsec.h"
 #include "xe_wa.h"
 
@@ -509,12 +510,48 @@ static const struct file_operations disable_late_binding_fops = {
 	.write = disable_late_binding_set,
 };
 
+static ssize_t addr_fault_reporting_show(struct file *f, char __user *ubuf,
+					 size_t size, loff_t *pos)
+{
+	struct xe_device *xe = file_inode(f)->i_private;
+	char buf[32];
+	int len;
+
+	len = scnprintf(buf, sizeof(buf), "%lld\n", xe->mem.vram->ttm.fault_addr);
+
+	return simple_read_from_buffer(ubuf, size, pos, buf, len);
+}
+
+static ssize_t addr_fault_reporting_set(struct file *f, const char __user *ubuf,
+					size_t size, loff_t *pos)
+{
+	struct xe_device *xe = file_inode(f)->i_private;
+	u64 addr;
+	int ret;
+
+	ret = kstrtou64_from_user(ubuf, size, 0, &addr);
+	if (ret)
+		return ret;
+
+	xe->mem.vram->ttm.fault_addr = addr;
+	xe_ttm_tbo_handle_addr_fault(xe_device_get_root_tile(xe), xe->mem.vram->ttm.fault_addr);
+
+	return size;
+}
+
+static const struct file_operations addr_fault_reporting_fops = {
+	.owner = THIS_MODULE,
+	.read = addr_fault_reporting_show,
+	.write = addr_fault_reporting_set,
+};
+
 void xe_debugfs_register(struct xe_device *xe)
 {
 	struct ttm_device *bdev = &xe->ttm;
 	struct drm_minor *minor = xe->drm.primary;
 	struct dentry *root = minor->debugfs_root;
 	struct ttm_resource_manager *man;
+	u8 mem_type = XE_PL_VRAM1;
 	struct xe_tile *tile;
 	struct xe_gt *gt;
 	u8 tile_id;
@@ -565,6 +602,18 @@ void xe_debugfs_register(struct xe_device *xe)
 	if (man)
 		ttm_resource_manager_create_debugfs(man, root, "stolen_mm");
 
+	do {
+		man = ttm_manager_type(bdev, mem_type);
+		if (man) {
+			char name[20];
+
+			snprintf(name, sizeof(name), "invalid_addr_vram%d", mem_type - XE_PL_VRAM0);
+			debugfs_create_file(name, 0600, root, xe,
+					    &addr_fault_reporting_fops);
+		}
+		--mem_type;
+	} while (mem_type >= XE_PL_VRAM0);
+
 	for_each_tile(tile, xe, tile_id)
 		xe_tile_debugfs_register(tile);
 
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
index 85511b51af75..c93573b9aab2 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
@@ -37,6 +37,8 @@ struct xe_ttm_vram_mgr {
 	struct mutex lock;
 	/** @mem_type: The TTM memory type */
 	u32 mem_type;
+	/** @fault_addr: debugfs hook for setting faulty address */
+	u64 fault_addr;
 };
 
 /**
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 5/6] drm/xe: Add routine to dump allocated VRAM blocks
  2026-02-13  9:25 [RFC PATCH 0/6] Add memory page offlining support Tejas Upadhyay
                   ` (3 preceding siblings ...)
  2026-02-13  9:25 ` [RFC PATCH 4/6] [DO NOT REVIEW]drm/xe/cri: Add debugfs to inject faulty vram address Tejas Upadhyay
@ 2026-02-13  9:25 ` Tejas Upadhyay
  2026-02-13  9:25 ` [RFC PATCH 6/6] [DO NOT REVIEW]]drm/xe/cri: Add sysfs interface for bad gpu vram pages Tejas Upadhyay
  5 siblings, 0 replies; 13+ messages in thread
From: Tejas Upadhyay @ 2026-02-13  9:25 UTC (permalink / raw)
  To: intel-xe; +Cc: matthew.auld, matthew.brost, himal.prasad.ghimiray,
	Tejas Upadhyay

To implement the ability to see allocated blocks under a specific VRAM
instance in the drm/xe driver, new api is introduced. While existing structs
often show the free block list, this addition provides a comprehensive view
of all currently resident VRAM allocations.

Dump will look like,

[  +0.000003] xe 0000:03:00.0: [drm] 0x00000002f8000000-0x00000002f8800000: 8388608
[  +0.000005] xe 0000:03:00.0: [drm] 0x00000002f8800000-0x00000002f8840000: 262144
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f8840000-0x00000002f8860000: 131072
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f8860000-0x00000002f8870000: 65536
[  +0.000005] xe 0000:03:00.0: [drm] 0x00000002f9000000-0x00000002f9800000: 8388608
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f9800000-0x00000002f9880000: 524288
[  +0.000005] xe 0000:03:00.0: [drm] 0x00000002f9880000-0x00000002f9884000: 16384
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f9900000-0x00000002f9980000: 524288
[  +0.000005] xe 0000:03:00.0: [drm] 0x00000002f9980000-0x00000002f9988000: 32768
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f9988000-0x00000002f998c000: 16384

V2: remove unused code

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 30 ++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.h |  3 +++
 2 files changed, 33 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index 82d11348e5ce..cb3394000e83 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -505,6 +505,36 @@ static inline bool contains(u64 s1, u64 e1, u64 s2, u64 e2)
 	return s1 <= s2 && e1 <= e2;
 }
 
+void xe_ttm_vram_dump_allocated_blocks(struct drm_device *dev, struct drm_buddy *mm,
+				       struct drm_printer *p)
+{
+	struct drm_buddy_block *block;
+	LIST_HEAD(dfs);
+	int i;
+
+	for (i = 0; i < mm->n_roots; ++i)
+		list_add_tail(&mm->roots[i]->tmp_link, &dfs);
+
+	do {
+		block = list_first_entry_or_null(&dfs,
+						 struct drm_buddy_block,
+						 tmp_link);
+		if (!block)
+			break;
+
+		list_del(&block->tmp_link);
+
+		if (drm_buddy_block_is_allocated(block))
+			drm_buddy_block_print(mm, block, p);
+
+		if (drm_buddy_block_is_split(block)) {
+			list_add(&block->right->tmp_link, &dfs);
+			list_add(&block->left->tmp_link, &dfs);
+		}
+	} while (1);
+}
+EXPORT_SYMBOL(xe_ttm_vram_dump_allocated_blocks);
+
 static struct ttm_buffer_object *xe_ttm_vram_addr_to_tbo(struct drm_buddy *mm, u64 start)
 {
 	struct drm_buddy_block *block;
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
index 1d6075411ebf..5872e8b48779 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
@@ -12,6 +12,7 @@ enum dma_data_direction;
 struct xe_device;
 struct xe_tile;
 struct xe_vram_region;
+struct drm_device;
 
 int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
 			   u32 mem_type, u64 size, u64 io_size,
@@ -31,6 +32,8 @@ u64 xe_ttm_vram_get_cpu_visible_size(struct ttm_resource_manager *man);
 void xe_ttm_vram_get_used(struct ttm_resource_manager *man,
 			  u64 *used, u64 *used_visible);
 int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr);
+void xe_ttm_vram_dump_allocated_blocks(struct drm_device *dev, struct drm_buddy *mm,
+				       struct drm_printer *p);
 static inline struct xe_ttm_vram_mgr_resource *
 to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
 {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH 6/6] [DO NOT REVIEW]]drm/xe/cri: Add sysfs interface for bad gpu vram pages
  2026-02-13  9:25 [RFC PATCH 0/6] Add memory page offlining support Tejas Upadhyay
                   ` (4 preceding siblings ...)
  2026-02-13  9:25 ` [RFC PATCH 5/6] drm/xe: Add routine to dump allocated VRAM blocks Tejas Upadhyay
@ 2026-02-13  9:25 ` Tejas Upadhyay
  2026-02-18  0:37   ` Rodrigo Vivi
  5 siblings, 1 reply; 13+ messages in thread
From: Tejas Upadhyay @ 2026-02-13  9:25 UTC (permalink / raw)
  To: intel-xe; +Cc: matthew.auld, matthew.brost, himal.prasad.ghimiray,
	Tejas Upadhyay

Starting CRI, Include a sysfs interface designed to expose information
about bad VRAM pages—those identified as having hardware faults
(e.g., ECC errors). This interface allows userspace tools and
administrators to monitor the health of the GPU's local memory and
track the status of page retirement.To get details on bad gpu vram
pages can be found under /sys/bus/pci/devices/bdf/vram_bad_pages.

Where The format is, pfn : gpu page size : flags

flags:
R: reserved, this gpu page is reserved.
P: pending for reserve, this gpu page is marked as bad, will be reserved in next window of page_reserve.
F: unable to reserve. this gpu page can’t be reserved due to some reasons.

For example if you read using cat /sys/bus/pci/devices/bdf/vram_bad_pages,
0x00000000 : 0x00001000 : R
0x00001234 : 0x00001000 : P

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_device_sysfs.c |  2 +
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 78 ++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.h |  1 +
 3 files changed, 81 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device_sysfs.c b/drivers/gpu/drm/xe/xe_device_sysfs.c
index a73e0e957cb0..e6a017601428 100644
--- a/drivers/gpu/drm/xe/xe_device_sysfs.c
+++ b/drivers/gpu/drm/xe/xe_device_sysfs.c
@@ -14,6 +14,7 @@
 #include "xe_pcode_api.h"
 #include "xe_pcode.h"
 #include "xe_pm.h"
+#include "xe_ttm_vram_mgr.h"
 
 /**
  * DOC: Xe device sysfs
@@ -284,6 +285,7 @@ int xe_device_sysfs_init(struct xe_device *xe)
 		if (ret)
 			return ret;
 	}
+	xe_ttm_vram_sysfs_init(xe);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index cb3394000e83..c6a81ccaa9d2 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -744,3 +744,81 @@ int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr)
 	return ret;
 }
 EXPORT_SYMBOL(xe_ttm_tbo_handle_addr_fault);
+
+static void xe_ttm_vram_dump_bad_pages_info(char *buf, struct xe_ttm_vram_mgr *mgr)
+{
+	const unsigned int element_size = sizeof("0xabcdabcd : 0x12345678 : R\n") - 1;
+	struct xe_ttm_offline_resource *pos, *n;
+	struct drm_buddy_block *block;
+	ssize_t s = 0;
+
+	mutex_lock(&mgr->lock);
+	list_for_each_entry_safe(pos, n, &mgr->offlined_pages, offlined_link) {
+		block = list_first_entry(&pos->blocks,
+					 struct drm_buddy_block,
+					 link);
+		s += scnprintf(&buf[s], element_size + 1,
+			       "0x%08llx : 0x%08llx : %1s\n",
+			       drm_buddy_block_offset(block) >> PAGE_SHIFT,
+			       drm_buddy_block_size(&mgr->mm, block),
+			       "R");
+	}
+	list_for_each_entry_safe(pos, n, &mgr->queued_pages, queued_link) {
+		block = list_first_entry(&pos->blocks,
+					 struct drm_buddy_block,
+					 link);
+		s += scnprintf(&buf[s], element_size + 1,
+			       "0x%08llx : 0x%08llx : %1s\n",
+			       drm_buddy_block_offset(block) >> PAGE_SHIFT,
+			       drm_buddy_block_size(&mgr->mm, block),
+			       "P");
+	}
+	mutex_unlock(&mgr->lock);
+}
+
+static ssize_t vram_bad_pages_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct xe_device *xe = pdev_to_xe_device(pdev);
+	struct ttm_resource_manager *man;
+	u8 mem_type = XE_PL_VRAM1;
+
+	do {
+		man = ttm_manager_type(&xe->ttm, mem_type);
+		struct xe_ttm_vram_mgr *mgr = to_xe_ttm_vram_mgr(man);
+
+		if (man)
+			xe_ttm_vram_dump_bad_pages_info(buf, mgr);
+		--mem_type;
+	} while (mem_type >= XE_PL_VRAM0);
+
+	return sysfs_emit(buf, "%s\n", buf);
+}
+static DEVICE_ATTR_RO(vram_bad_pages);
+
+static void xe_ttm_vram_sysfs_fini(void *arg)
+{
+	struct xe_device *xe = arg;
+
+	device_remove_file(xe->drm.dev, &dev_attr_vram_bad_pages);
+}
+
+/**
+ * xe_ttm_vram_sysfs_init - Initialize vram sysfs component
+ * @tile: Xe Tile object
+ *
+ * It needs to be initialized after the main tile component is ready
+ *
+ * Returns: 0 on success, negative error code on error.
+ */
+int xe_ttm_vram_sysfs_init(struct xe_device *xe)
+{
+	int err;
+
+	err = device_create_file(xe->drm.dev, &dev_attr_vram_bad_pages);
+	if (err)
+		return 0;
+
+	return devm_add_action_or_reset(xe->drm.dev, xe_ttm_vram_sysfs_fini, xe);
+}
+EXPORT_SYMBOL(xe_ttm_vram_sysfs_init);
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
index 5872e8b48779..6e69140c0be8 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
@@ -34,6 +34,7 @@ void xe_ttm_vram_get_used(struct ttm_resource_manager *man,
 int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr);
 void xe_ttm_vram_dump_allocated_blocks(struct drm_device *dev, struct drm_buddy *mm,
 				       struct drm_printer *p);
+int xe_ttm_vram_sysfs_init(struct xe_device *xe);
 static inline struct xe_ttm_vram_mgr_resource *
 to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
 {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 6/6] [DO NOT REVIEW]]drm/xe/cri: Add sysfs interface for bad gpu vram pages
  2026-02-13  9:25 ` [RFC PATCH 6/6] [DO NOT REVIEW]]drm/xe/cri: Add sysfs interface for bad gpu vram pages Tejas Upadhyay
@ 2026-02-18  0:37   ` Rodrigo Vivi
  2026-02-20 11:18     ` Aravind Iddamsetty
  0 siblings, 1 reply; 13+ messages in thread
From: Rodrigo Vivi @ 2026-02-18  0:37 UTC (permalink / raw)
  To: Tejas Upadhyay, Riana Tauro, Raag Jadav, Aravind Iddamsetty
  Cc: intel-xe, matthew.auld, matthew.brost, himal.prasad.ghimiray

On Fri, Feb 13, 2026 at 02:55:59PM +0530, Tejas Upadhyay wrote:
> Starting CRI, Include a sysfs interface designed to expose information
> about bad VRAM pages—those identified as having hardware faults
> (e.g., ECC errors). This interface allows userspace tools and
> administrators to monitor the health of the GPU's local memory and
> track the status of page retirement.To get details on bad gpu vram
> pages can be found under /sys/bus/pci/devices/bdf/vram_bad_pages.
> 
> Where The format is, pfn : gpu page size : flags
> 
> flags:
> R: reserved, this gpu page is reserved.
> P: pending for reserve, this gpu page is marked as bad, will be reserved in next window of page_reserve.
> F: unable to reserve. this gpu page can’t be reserved due to some reasons.
> 
> For example if you read using cat /sys/bus/pci/devices/bdf/vram_bad_pages,
> 0x00000000 : 0x00001000 : R
> 0x00001234 : 0x00001000 : P

Riana, Raag, Aravind, a good new use case for the drm-ras no?!
Thoughts?

> 
> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device_sysfs.c |  2 +
>  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 78 ++++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_ttm_vram_mgr.h |  1 +
>  3 files changed, 81 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device_sysfs.c b/drivers/gpu/drm/xe/xe_device_sysfs.c
> index a73e0e957cb0..e6a017601428 100644
> --- a/drivers/gpu/drm/xe/xe_device_sysfs.c
> +++ b/drivers/gpu/drm/xe/xe_device_sysfs.c
> @@ -14,6 +14,7 @@
>  #include "xe_pcode_api.h"
>  #include "xe_pcode.h"
>  #include "xe_pm.h"
> +#include "xe_ttm_vram_mgr.h"
>  
>  /**
>   * DOC: Xe device sysfs
> @@ -284,6 +285,7 @@ int xe_device_sysfs_init(struct xe_device *xe)
>  		if (ret)
>  			return ret;
>  	}
> +	xe_ttm_vram_sysfs_init(xe);
>  
>  	return 0;
>  }
> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> index cb3394000e83..c6a81ccaa9d2 100644
> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> @@ -744,3 +744,81 @@ int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr)
>  	return ret;
>  }
>  EXPORT_SYMBOL(xe_ttm_tbo_handle_addr_fault);
> +
> +static void xe_ttm_vram_dump_bad_pages_info(char *buf, struct xe_ttm_vram_mgr *mgr)
> +{
> +	const unsigned int element_size = sizeof("0xabcdabcd : 0x12345678 : R\n") - 1;
> +	struct xe_ttm_offline_resource *pos, *n;
> +	struct drm_buddy_block *block;
> +	ssize_t s = 0;
> +
> +	mutex_lock(&mgr->lock);
> +	list_for_each_entry_safe(pos, n, &mgr->offlined_pages, offlined_link) {
> +		block = list_first_entry(&pos->blocks,
> +					 struct drm_buddy_block,
> +					 link);
> +		s += scnprintf(&buf[s], element_size + 1,
> +			       "0x%08llx : 0x%08llx : %1s\n",
> +			       drm_buddy_block_offset(block) >> PAGE_SHIFT,
> +			       drm_buddy_block_size(&mgr->mm, block),
> +			       "R");
> +	}
> +	list_for_each_entry_safe(pos, n, &mgr->queued_pages, queued_link) {
> +		block = list_first_entry(&pos->blocks,
> +					 struct drm_buddy_block,
> +					 link);
> +		s += scnprintf(&buf[s], element_size + 1,
> +			       "0x%08llx : 0x%08llx : %1s\n",
> +			       drm_buddy_block_offset(block) >> PAGE_SHIFT,
> +			       drm_buddy_block_size(&mgr->mm, block),
> +			       "P");
> +	}
> +	mutex_unlock(&mgr->lock);
> +}
> +
> +static ssize_t vram_bad_pages_show(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct xe_device *xe = pdev_to_xe_device(pdev);
> +	struct ttm_resource_manager *man;
> +	u8 mem_type = XE_PL_VRAM1;
> +
> +	do {
> +		man = ttm_manager_type(&xe->ttm, mem_type);
> +		struct xe_ttm_vram_mgr *mgr = to_xe_ttm_vram_mgr(man);
> +
> +		if (man)
> +			xe_ttm_vram_dump_bad_pages_info(buf, mgr);
> +		--mem_type;
> +	} while (mem_type >= XE_PL_VRAM0);
> +
> +	return sysfs_emit(buf, "%s\n", buf);
> +}
> +static DEVICE_ATTR_RO(vram_bad_pages);
> +
> +static void xe_ttm_vram_sysfs_fini(void *arg)
> +{
> +	struct xe_device *xe = arg;
> +
> +	device_remove_file(xe->drm.dev, &dev_attr_vram_bad_pages);
> +}
> +
> +/**
> + * xe_ttm_vram_sysfs_init - Initialize vram sysfs component
> + * @tile: Xe Tile object
> + *
> + * It needs to be initialized after the main tile component is ready
> + *
> + * Returns: 0 on success, negative error code on error.
> + */
> +int xe_ttm_vram_sysfs_init(struct xe_device *xe)
> +{
> +	int err;
> +
> +	err = device_create_file(xe->drm.dev, &dev_attr_vram_bad_pages);
> +	if (err)
> +		return 0;
> +
> +	return devm_add_action_or_reset(xe->drm.dev, xe_ttm_vram_sysfs_fini, xe);
> +}
> +EXPORT_SYMBOL(xe_ttm_vram_sysfs_init);
> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> index 5872e8b48779..6e69140c0be8 100644
> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> @@ -34,6 +34,7 @@ void xe_ttm_vram_get_used(struct ttm_resource_manager *man,
>  int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr);
>  void xe_ttm_vram_dump_allocated_blocks(struct drm_device *dev, struct drm_buddy *mm,
>  				       struct drm_printer *p);
> +int xe_ttm_vram_sysfs_init(struct xe_device *xe);
>  static inline struct xe_ttm_vram_mgr_resource *
>  to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
>  {
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 6/6] [DO NOT REVIEW]]drm/xe/cri: Add sysfs interface for bad gpu vram pages
  2026-02-18  0:37   ` Rodrigo Vivi
@ 2026-02-20 11:18     ` Aravind Iddamsetty
  2026-02-20 14:52       ` Vivi, Rodrigo
  0 siblings, 1 reply; 13+ messages in thread
From: Aravind Iddamsetty @ 2026-02-20 11:18 UTC (permalink / raw)
  To: Rodrigo Vivi, Tejas Upadhyay, Riana Tauro, Raag Jadav
  Cc: intel-xe, matthew.auld, matthew.brost, himal.prasad.ghimiray


On 18-02-2026 06:07, Rodrigo Vivi wrote:
> On Fri, Feb 13, 2026 at 02:55:59PM +0530, Tejas Upadhyay wrote:
>> Starting CRI, Include a sysfs interface designed to expose information
>> about bad VRAM pages—those identified as having hardware faults
>> (e.g., ECC errors). This interface allows userspace tools and
>> administrators to monitor the health of the GPU's local memory and
>> track the status of page retirement.To get details on bad gpu vram
>> pages can be found under /sys/bus/pci/devices/bdf/vram_bad_pages.
>>
>> Where The format is, pfn : gpu page size : flags
>>
>> flags:
>> R: reserved, this gpu page is reserved.
>> P: pending for reserve, this gpu page is marked as bad, will be reserved in next window of page_reserve.
>> F: unable to reserve. this gpu page can’t be reserved due to some reasons.
>>
>> For example if you read using cat /sys/bus/pci/devices/bdf/vram_bad_pages,
>> 0x00000000 : 0x00001000 : R
>> 0x00001234 : 0x00001000 : P
> Riana, Raag, Aravind, a good new use case for the drm-ras no?!
> Thoughts?

In general the feature can be supported via drm-ras framework, but is
the motivation to move all error related info to drm-ras, also any gpu
hang data, health etc..,

Thanks,
Aravind.
>
>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
>> ---
>>  drivers/gpu/drm/xe/xe_device_sysfs.c |  2 +
>>  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 78 ++++++++++++++++++++++++++++
>>  drivers/gpu/drm/xe/xe_ttm_vram_mgr.h |  1 +
>>  3 files changed, 81 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_device_sysfs.c b/drivers/gpu/drm/xe/xe_device_sysfs.c
>> index a73e0e957cb0..e6a017601428 100644
>> --- a/drivers/gpu/drm/xe/xe_device_sysfs.c
>> +++ b/drivers/gpu/drm/xe/xe_device_sysfs.c
>> @@ -14,6 +14,7 @@
>>  #include "xe_pcode_api.h"
>>  #include "xe_pcode.h"
>>  #include "xe_pm.h"
>> +#include "xe_ttm_vram_mgr.h"
>>  
>>  /**
>>   * DOC: Xe device sysfs
>> @@ -284,6 +285,7 @@ int xe_device_sysfs_init(struct xe_device *xe)
>>  		if (ret)
>>  			return ret;
>>  	}
>> +	xe_ttm_vram_sysfs_init(xe);
>>  
>>  	return 0;
>>  }
>> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
>> index cb3394000e83..c6a81ccaa9d2 100644
>> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
>> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
>> @@ -744,3 +744,81 @@ int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr)
>>  	return ret;
>>  }
>>  EXPORT_SYMBOL(xe_ttm_tbo_handle_addr_fault);
>> +
>> +static void xe_ttm_vram_dump_bad_pages_info(char *buf, struct xe_ttm_vram_mgr *mgr)
>> +{
>> +	const unsigned int element_size = sizeof("0xabcdabcd : 0x12345678 : R\n") - 1;
>> +	struct xe_ttm_offline_resource *pos, *n;
>> +	struct drm_buddy_block *block;
>> +	ssize_t s = 0;
>> +
>> +	mutex_lock(&mgr->lock);
>> +	list_for_each_entry_safe(pos, n, &mgr->offlined_pages, offlined_link) {
>> +		block = list_first_entry(&pos->blocks,
>> +					 struct drm_buddy_block,
>> +					 link);
>> +		s += scnprintf(&buf[s], element_size + 1,
>> +			       "0x%08llx : 0x%08llx : %1s\n",
>> +			       drm_buddy_block_offset(block) >> PAGE_SHIFT,
>> +			       drm_buddy_block_size(&mgr->mm, block),
>> +			       "R");
>> +	}
>> +	list_for_each_entry_safe(pos, n, &mgr->queued_pages, queued_link) {
>> +		block = list_first_entry(&pos->blocks,
>> +					 struct drm_buddy_block,
>> +					 link);
>> +		s += scnprintf(&buf[s], element_size + 1,
>> +			       "0x%08llx : 0x%08llx : %1s\n",
>> +			       drm_buddy_block_offset(block) >> PAGE_SHIFT,
>> +			       drm_buddy_block_size(&mgr->mm, block),
>> +			       "P");
>> +	}
>> +	mutex_unlock(&mgr->lock);
>> +}
>> +
>> +static ssize_t vram_bad_pages_show(struct device *dev, struct device_attribute *attr, char *buf)
>> +{
>> +	struct pci_dev *pdev = to_pci_dev(dev);
>> +	struct xe_device *xe = pdev_to_xe_device(pdev);
>> +	struct ttm_resource_manager *man;
>> +	u8 mem_type = XE_PL_VRAM1;
>> +
>> +	do {
>> +		man = ttm_manager_type(&xe->ttm, mem_type);
>> +		struct xe_ttm_vram_mgr *mgr = to_xe_ttm_vram_mgr(man);
>> +
>> +		if (man)
>> +			xe_ttm_vram_dump_bad_pages_info(buf, mgr);
>> +		--mem_type;
>> +	} while (mem_type >= XE_PL_VRAM0);
>> +
>> +	return sysfs_emit(buf, "%s\n", buf);
>> +}
>> +static DEVICE_ATTR_RO(vram_bad_pages);
>> +
>> +static void xe_ttm_vram_sysfs_fini(void *arg)
>> +{
>> +	struct xe_device *xe = arg;
>> +
>> +	device_remove_file(xe->drm.dev, &dev_attr_vram_bad_pages);
>> +}
>> +
>> +/**
>> + * xe_ttm_vram_sysfs_init - Initialize vram sysfs component
>> + * @tile: Xe Tile object
>> + *
>> + * It needs to be initialized after the main tile component is ready
>> + *
>> + * Returns: 0 on success, negative error code on error.
>> + */
>> +int xe_ttm_vram_sysfs_init(struct xe_device *xe)
>> +{
>> +	int err;
>> +
>> +	err = device_create_file(xe->drm.dev, &dev_attr_vram_bad_pages);
>> +	if (err)
>> +		return 0;
>> +
>> +	return devm_add_action_or_reset(xe->drm.dev, xe_ttm_vram_sysfs_fini, xe);
>> +}
>> +EXPORT_SYMBOL(xe_ttm_vram_sysfs_init);
>> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
>> index 5872e8b48779..6e69140c0be8 100644
>> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
>> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
>> @@ -34,6 +34,7 @@ void xe_ttm_vram_get_used(struct ttm_resource_manager *man,
>>  int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr);
>>  void xe_ttm_vram_dump_allocated_blocks(struct drm_device *dev, struct drm_buddy *mm,
>>  				       struct drm_printer *p);
>> +int xe_ttm_vram_sysfs_init(struct xe_device *xe);
>>  static inline struct xe_ttm_vram_mgr_resource *
>>  to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
>>  {
>> -- 
>> 2.52.0
>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 6/6] [DO NOT REVIEW]]drm/xe/cri: Add sysfs interface for bad gpu vram pages
  2026-02-20 11:18     ` Aravind Iddamsetty
@ 2026-02-20 14:52       ` Vivi, Rodrigo
  2026-02-22  5:32         ` Aravind Iddamsetty
  0 siblings, 1 reply; 13+ messages in thread
From: Vivi, Rodrigo @ 2026-02-20 14:52 UTC (permalink / raw)
  To: Upadhyay, Tejas, Tauro, Riana, aravind.iddamsetty@linux.intel.com,
	Jadav, Raag
  Cc: intel-xe@lists.freedesktop.org, Brost,  Matthew,
	Ghimiray, Himal Prasad, Auld, Matthew

On Fri, 2026-02-20 at 16:48 +0530, Aravind Iddamsetty wrote:
> 
> On 18-02-2026 06:07, Rodrigo Vivi wrote:
> > On Fri, Feb 13, 2026 at 02:55:59PM +0530, Tejas Upadhyay wrote:
> > > Starting CRI, Include a sysfs interface designed to expose
> > > information
> > > about bad VRAM pages—those identified as having hardware faults
> > > (e.g., ECC errors). This interface allows userspace tools and
> > > administrators to monitor the health of the GPU's local memory
> > > and
> > > track the status of page retirement.To get details on bad gpu
> > > vram
> > > pages can be found under /sys/bus/pci/devices/bdf/vram_bad_pages.
> > > 
> > > Where The format is, pfn : gpu page size : flags
> > > 
> > > flags:
> > > R: reserved, this gpu page is reserved.
> > > P: pending for reserve, this gpu page is marked as bad, will be
> > > reserved in next window of page_reserve.
> > > F: unable to reserve. this gpu page can’t be reserved due to some
> > > reasons.
> > > 
> > > For example if you read using cat
> > > /sys/bus/pci/devices/bdf/vram_bad_pages,
> > > 0x00000000 : 0x00001000 : R
> > > 0x00001234 : 0x00001000 : P
> > Riana, Raag, Aravind, a good new use case for the drm-ras no?!
> > Thoughts?
> 
> In general the feature can be supported via drm-ras framework, but is
> the motivation to move all error related info to drm-ras, also any
> gpu
> hang data, health etc..,

No, the motivation is not to move everything there.
I was thinking on the lines of avoiding sysfs. But well, it is only
one sysfs and I don't believe we would need any notification upon
new pages marked as bad, or do we?

If we need notifications then sysfs is bad choice and netlink could
provide info and notification in a single place.

> 
> Thanks,
> Aravind.
> > 
> > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_device_sysfs.c |  2 +
> > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 78
> > > ++++++++++++++++++++++++++++
> > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.h |  1 +
> > >  3 files changed, 81 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_device_sysfs.c
> > > b/drivers/gpu/drm/xe/xe_device_sysfs.c
> > > index a73e0e957cb0..e6a017601428 100644
> > > --- a/drivers/gpu/drm/xe/xe_device_sysfs.c
> > > +++ b/drivers/gpu/drm/xe/xe_device_sysfs.c
> > > @@ -14,6 +14,7 @@
> > >  #include "xe_pcode_api.h"
> > >  #include "xe_pcode.h"
> > >  #include "xe_pm.h"
> > > +#include "xe_ttm_vram_mgr.h"
> > >  
> > >  /**
> > >   * DOC: Xe device sysfs
> > > @@ -284,6 +285,7 @@ int xe_device_sysfs_init(struct xe_device
> > > *xe)
> > >  		if (ret)
> > >  			return ret;
> > >  	}
> > > +	xe_ttm_vram_sysfs_init(xe);
> > >  
> > >  	return 0;
> > >  }
> > > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > index cb3394000e83..c6a81ccaa9d2 100644
> > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > @@ -744,3 +744,81 @@ int xe_ttm_tbo_handle_addr_fault(struct
> > > xe_tile *tile, unsigned long addr)
> > >  	return ret;
> > >  }
> > >  EXPORT_SYMBOL(xe_ttm_tbo_handle_addr_fault);
> > > +
> > > +static void xe_ttm_vram_dump_bad_pages_info(char *buf, struct
> > > xe_ttm_vram_mgr *mgr)
> > > +{
> > > +	const unsigned int element_size = sizeof("0xabcdabcd :
> > > 0x12345678 : R\n") - 1;
> > > +	struct xe_ttm_offline_resource *pos, *n;
> > > +	struct drm_buddy_block *block;
> > > +	ssize_t s = 0;
> > > +
> > > +	mutex_lock(&mgr->lock);
> > > +	list_for_each_entry_safe(pos, n, &mgr->offlined_pages,
> > > offlined_link) {
> > > +		block = list_first_entry(&pos->blocks,
> > > +					 struct drm_buddy_block,
> > > +					 link);
> > > +		s += scnprintf(&buf[s], element_size + 1,
> > > +			       "0x%08llx : 0x%08llx : %1s\n",
> > > +			       drm_buddy_block_offset(block) >>
> > > PAGE_SHIFT,
> > > +			       drm_buddy_block_size(&mgr->mm,
> > > block),
> > > +			       "R");
> > > +	}
> > > +	list_for_each_entry_safe(pos, n, &mgr->queued_pages,
> > > queued_link) {
> > > +		block = list_first_entry(&pos->blocks,
> > > +					 struct drm_buddy_block,
> > > +					 link);
> > > +		s += scnprintf(&buf[s], element_size + 1,
> > > +			       "0x%08llx : 0x%08llx : %1s\n",
> > > +			       drm_buddy_block_offset(block) >>
> > > PAGE_SHIFT,
> > > +			       drm_buddy_block_size(&mgr->mm,
> > > block),
> > > +			       "P");
> > > +	}
> > > +	mutex_unlock(&mgr->lock);
> > > +}
> > > +
> > > +static ssize_t vram_bad_pages_show(struct device *dev, struct
> > > device_attribute *attr, char *buf)
> > > +{
> > > +	struct pci_dev *pdev = to_pci_dev(dev);
> > > +	struct xe_device *xe = pdev_to_xe_device(pdev);
> > > +	struct ttm_resource_manager *man;
> > > +	u8 mem_type = XE_PL_VRAM1;
> > > +
> > > +	do {
> > > +		man = ttm_manager_type(&xe->ttm, mem_type);
> > > +		struct xe_ttm_vram_mgr *mgr =
> > > to_xe_ttm_vram_mgr(man);
> > > +
> > > +		if (man)
> > > +			xe_ttm_vram_dump_bad_pages_info(buf,
> > > mgr);
> > > +		--mem_type;
> > > +	} while (mem_type >= XE_PL_VRAM0);
> > > +
> > > +	return sysfs_emit(buf, "%s\n", buf);
> > > +}
> > > +static DEVICE_ATTR_RO(vram_bad_pages);
> > > +
> > > +static void xe_ttm_vram_sysfs_fini(void *arg)
> > > +{
> > > +	struct xe_device *xe = arg;
> > > +
> > > +	device_remove_file(xe->drm.dev,
> > > &dev_attr_vram_bad_pages);
> > > +}
> > > +
> > > +/**
> > > + * xe_ttm_vram_sysfs_init - Initialize vram sysfs component
> > > + * @tile: Xe Tile object
> > > + *
> > > + * It needs to be initialized after the main tile component is
> > > ready
> > > + *
> > > + * Returns: 0 on success, negative error code on error.
> > > + */
> > > +int xe_ttm_vram_sysfs_init(struct xe_device *xe)
> > > +{
> > > +	int err;
> > > +
> > > +	err = device_create_file(xe->drm.dev,
> > > &dev_attr_vram_bad_pages);
> > > +	if (err)
> > > +		return 0;
> > > +
> > > +	return devm_add_action_or_reset(xe->drm.dev,
> > > xe_ttm_vram_sysfs_fini, xe);
> > > +}
> > > +EXPORT_SYMBOL(xe_ttm_vram_sysfs_init);
> > > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > index 5872e8b48779..6e69140c0be8 100644
> > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > @@ -34,6 +34,7 @@ void xe_ttm_vram_get_used(struct
> > > ttm_resource_manager *man,
> > >  int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned
> > > long addr);
> > >  void xe_ttm_vram_dump_allocated_blocks(struct drm_device *dev,
> > > struct drm_buddy *mm,
> > >  				       struct drm_printer *p);
> > > +int xe_ttm_vram_sysfs_init(struct xe_device *xe);
> > >  static inline struct xe_ttm_vram_mgr_resource *
> > >  to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
> > >  {
> > > -- 
> > > 2.52.0
> > > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 6/6] [DO NOT REVIEW]]drm/xe/cri: Add sysfs interface for bad gpu vram pages
  2026-02-20 14:52       ` Vivi, Rodrigo
@ 2026-02-22  5:32         ` Aravind Iddamsetty
  2026-02-23 21:26           ` Rodrigo Vivi
  0 siblings, 1 reply; 13+ messages in thread
From: Aravind Iddamsetty @ 2026-02-22  5:32 UTC (permalink / raw)
  To: Vivi, Rodrigo, Upadhyay, Tejas, Tauro, Riana, Jadav, Raag
  Cc: intel-xe@lists.freedesktop.org, Brost, Matthew,
	Ghimiray, Himal Prasad, Auld, Matthew


On 20-02-2026 20:22, Vivi, Rodrigo wrote:
> On Fri, 2026-02-20 at 16:48 +0530, Aravind Iddamsetty wrote:
>> On 18-02-2026 06:07, Rodrigo Vivi wrote:
>>> On Fri, Feb 13, 2026 at 02:55:59PM +0530, Tejas Upadhyay wrote:
>>>> Starting CRI, Include a sysfs interface designed to expose
>>>> information
>>>> about bad VRAM pages—those identified as having hardware faults
>>>> (e.g., ECC errors). This interface allows userspace tools and
>>>> administrators to monitor the health of the GPU's local memory
>>>> and
>>>> track the status of page retirement.To get details on bad gpu
>>>> vram
>>>> pages can be found under /sys/bus/pci/devices/bdf/vram_bad_pages.
>>>>
>>>> Where The format is, pfn : gpu page size : flags
>>>>
>>>> flags:
>>>> R: reserved, this gpu page is reserved.
>>>> P: pending for reserve, this gpu page is marked as bad, will be
>>>> reserved in next window of page_reserve.
>>>> F: unable to reserve. this gpu page can’t be reserved due to some
>>>> reasons.
>>>>
>>>> For example if you read using cat
>>>> /sys/bus/pci/devices/bdf/vram_bad_pages,
>>>> 0x00000000 : 0x00001000 : R
>>>> 0x00001234 : 0x00001000 : P
>>> Riana, Raag, Aravind, a good new use case for the drm-ras no?!
>>> Thoughts?
>> In general the feature can be supported via drm-ras framework, but is
>> the motivation to move all error related info to drm-ras, also any
>> gpu
>> hang data, health etc..,
> No, the motivation is not to move everything there.
> I was thinking on the lines of avoiding sysfs. But well, it is only
> one sysfs and I don't believe we would need any notification upon
> new pages marked as bad, or do we?

I do not see a requirement to have any additional notification, as there
would already be a notification for an uncorrectable error.

Thanks,
Aravind.
>
> If we need notifications then sysfs is bad choice and netlink could
> provide info and notification in a single place.
>
>> Thanks,
>> Aravind.
>>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
>>>> ---
>>>>  drivers/gpu/drm/xe/xe_device_sysfs.c |  2 +
>>>>  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 78
>>>> ++++++++++++++++++++++++++++
>>>>  drivers/gpu/drm/xe/xe_ttm_vram_mgr.h |  1 +
>>>>  3 files changed, 81 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_device_sysfs.c
>>>> b/drivers/gpu/drm/xe/xe_device_sysfs.c
>>>> index a73e0e957cb0..e6a017601428 100644
>>>> --- a/drivers/gpu/drm/xe/xe_device_sysfs.c
>>>> +++ b/drivers/gpu/drm/xe/xe_device_sysfs.c
>>>> @@ -14,6 +14,7 @@
>>>>  #include "xe_pcode_api.h"
>>>>  #include "xe_pcode.h"
>>>>  #include "xe_pm.h"
>>>> +#include "xe_ttm_vram_mgr.h"
>>>>  
>>>>  /**
>>>>   * DOC: Xe device sysfs
>>>> @@ -284,6 +285,7 @@ int xe_device_sysfs_init(struct xe_device
>>>> *xe)
>>>>  		if (ret)
>>>>  			return ret;
>>>>  	}
>>>> +	xe_ttm_vram_sysfs_init(xe);
>>>>  
>>>>  	return 0;
>>>>  }
>>>> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
>>>> b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
>>>> index cb3394000e83..c6a81ccaa9d2 100644
>>>> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
>>>> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
>>>> @@ -744,3 +744,81 @@ int xe_ttm_tbo_handle_addr_fault(struct
>>>> xe_tile *tile, unsigned long addr)
>>>>  	return ret;
>>>>  }
>>>>  EXPORT_SYMBOL(xe_ttm_tbo_handle_addr_fault);
>>>> +
>>>> +static void xe_ttm_vram_dump_bad_pages_info(char *buf, struct
>>>> xe_ttm_vram_mgr *mgr)
>>>> +{
>>>> +	const unsigned int element_size = sizeof("0xabcdabcd :
>>>> 0x12345678 : R\n") - 1;
>>>> +	struct xe_ttm_offline_resource *pos, *n;
>>>> +	struct drm_buddy_block *block;
>>>> +	ssize_t s = 0;
>>>> +
>>>> +	mutex_lock(&mgr->lock);
>>>> +	list_for_each_entry_safe(pos, n, &mgr->offlined_pages,
>>>> offlined_link) {
>>>> +		block = list_first_entry(&pos->blocks,
>>>> +					 struct drm_buddy_block,
>>>> +					 link);
>>>> +		s += scnprintf(&buf[s], element_size + 1,
>>>> +			       "0x%08llx : 0x%08llx : %1s\n",
>>>> +			       drm_buddy_block_offset(block) >>
>>>> PAGE_SHIFT,
>>>> +			       drm_buddy_block_size(&mgr->mm,
>>>> block),
>>>> +			       "R");
>>>> +	}
>>>> +	list_for_each_entry_safe(pos, n, &mgr->queued_pages,
>>>> queued_link) {
>>>> +		block = list_first_entry(&pos->blocks,
>>>> +					 struct drm_buddy_block,
>>>> +					 link);
>>>> +		s += scnprintf(&buf[s], element_size + 1,
>>>> +			       "0x%08llx : 0x%08llx : %1s\n",
>>>> +			       drm_buddy_block_offset(block) >>
>>>> PAGE_SHIFT,
>>>> +			       drm_buddy_block_size(&mgr->mm,
>>>> block),
>>>> +			       "P");
>>>> +	}
>>>> +	mutex_unlock(&mgr->lock);
>>>> +}
>>>> +
>>>> +static ssize_t vram_bad_pages_show(struct device *dev, struct
>>>> device_attribute *attr, char *buf)
>>>> +{
>>>> +	struct pci_dev *pdev = to_pci_dev(dev);
>>>> +	struct xe_device *xe = pdev_to_xe_device(pdev);
>>>> +	struct ttm_resource_manager *man;
>>>> +	u8 mem_type = XE_PL_VRAM1;
>>>> +
>>>> +	do {
>>>> +		man = ttm_manager_type(&xe->ttm, mem_type);
>>>> +		struct xe_ttm_vram_mgr *mgr =
>>>> to_xe_ttm_vram_mgr(man);
>>>> +
>>>> +		if (man)
>>>> +			xe_ttm_vram_dump_bad_pages_info(buf,
>>>> mgr);
>>>> +		--mem_type;
>>>> +	} while (mem_type >= XE_PL_VRAM0);
>>>> +
>>>> +	return sysfs_emit(buf, "%s\n", buf);
>>>> +}
>>>> +static DEVICE_ATTR_RO(vram_bad_pages);
>>>> +
>>>> +static void xe_ttm_vram_sysfs_fini(void *arg)
>>>> +{
>>>> +	struct xe_device *xe = arg;
>>>> +
>>>> +	device_remove_file(xe->drm.dev,
>>>> &dev_attr_vram_bad_pages);
>>>> +}
>>>> +
>>>> +/**
>>>> + * xe_ttm_vram_sysfs_init - Initialize vram sysfs component
>>>> + * @tile: Xe Tile object
>>>> + *
>>>> + * It needs to be initialized after the main tile component is
>>>> ready
>>>> + *
>>>> + * Returns: 0 on success, negative error code on error.
>>>> + */
>>>> +int xe_ttm_vram_sysfs_init(struct xe_device *xe)
>>>> +{
>>>> +	int err;
>>>> +
>>>> +	err = device_create_file(xe->drm.dev,
>>>> &dev_attr_vram_bad_pages);
>>>> +	if (err)
>>>> +		return 0;
>>>> +
>>>> +	return devm_add_action_or_reset(xe->drm.dev,
>>>> xe_ttm_vram_sysfs_fini, xe);
>>>> +}
>>>> +EXPORT_SYMBOL(xe_ttm_vram_sysfs_init);
>>>> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
>>>> b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
>>>> index 5872e8b48779..6e69140c0be8 100644
>>>> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
>>>> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
>>>> @@ -34,6 +34,7 @@ void xe_ttm_vram_get_used(struct
>>>> ttm_resource_manager *man,
>>>>  int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned
>>>> long addr);
>>>>  void xe_ttm_vram_dump_allocated_blocks(struct drm_device *dev,
>>>> struct drm_buddy *mm,
>>>>  				       struct drm_printer *p);
>>>> +int xe_ttm_vram_sysfs_init(struct xe_device *xe);
>>>>  static inline struct xe_ttm_vram_mgr_resource *
>>>>  to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
>>>>  {
>>>> -- 
>>>> 2.52.0
>>>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 6/6] [DO NOT REVIEW]]drm/xe/cri: Add sysfs interface for bad gpu vram pages
  2026-02-22  5:32         ` Aravind Iddamsetty
@ 2026-02-23 21:26           ` Rodrigo Vivi
  0 siblings, 0 replies; 13+ messages in thread
From: Rodrigo Vivi @ 2026-02-23 21:26 UTC (permalink / raw)
  To: Aravind Iddamsetty
  Cc: Upadhyay, Tejas, Tauro, Riana, Jadav, Raag,
	intel-xe@lists.freedesktop.org, Brost, Matthew,
	Ghimiray, Himal Prasad, Auld, Matthew

On Sun, Feb 22, 2026 at 11:02:25AM +0530, Aravind Iddamsetty wrote:
> 
> On 20-02-2026 20:22, Vivi, Rodrigo wrote:
> > On Fri, 2026-02-20 at 16:48 +0530, Aravind Iddamsetty wrote:
> >> On 18-02-2026 06:07, Rodrigo Vivi wrote:
> >>> On Fri, Feb 13, 2026 at 02:55:59PM +0530, Tejas Upadhyay wrote:
> >>>> Starting CRI, Include a sysfs interface designed to expose
> >>>> information
> >>>> about bad VRAM pages—those identified as having hardware faults
> >>>> (e.g., ECC errors). This interface allows userspace tools and
> >>>> administrators to monitor the health of the GPU's local memory
> >>>> and
> >>>> track the status of page retirement.To get details on bad gpu
> >>>> vram
> >>>> pages can be found under /sys/bus/pci/devices/bdf/vram_bad_pages.
> >>>>
> >>>> Where The format is, pfn : gpu page size : flags
> >>>>
> >>>> flags:
> >>>> R: reserved, this gpu page is reserved.
> >>>> P: pending for reserve, this gpu page is marked as bad, will be
> >>>> reserved in next window of page_reserve.
> >>>> F: unable to reserve. this gpu page can’t be reserved due to some
> >>>> reasons.
> >>>>
> >>>> For example if you read using cat
> >>>> /sys/bus/pci/devices/bdf/vram_bad_pages,
> >>>> 0x00000000 : 0x00001000 : R
> >>>> 0x00001234 : 0x00001000 : P
> >>> Riana, Raag, Aravind, a good new use case for the drm-ras no?!
> >>> Thoughts?
> >> In general the feature can be supported via drm-ras framework, but is
> >> the motivation to move all error related info to drm-ras, also any
> >> gpu
> >> hang data, health etc..,
> > No, the motivation is not to move everything there.
> > I was thinking on the lines of avoiding sysfs. But well, it is only
> > one sysfs and I don't believe we would need any notification upon
> > new pages marked as bad, or do we?
> 
> I do not see a requirement to have any additional notification, as there
> would already be a notification for an uncorrectable error.

Fair enough then

Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

on continuing with this sysfs entry for this case.

> 
> Thanks,
> Aravind.
> >
> > If we need notifications then sysfs is bad choice and netlink could
> > provide info and notification in a single place.
> >
> >> Thanks,
> >> Aravind.
> >>>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> >>>> ---
> >>>>  drivers/gpu/drm/xe/xe_device_sysfs.c |  2 +
> >>>>  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 78
> >>>> ++++++++++++++++++++++++++++
> >>>>  drivers/gpu/drm/xe/xe_ttm_vram_mgr.h |  1 +
> >>>>  3 files changed, 81 insertions(+)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/xe/xe_device_sysfs.c
> >>>> b/drivers/gpu/drm/xe/xe_device_sysfs.c
> >>>> index a73e0e957cb0..e6a017601428 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_device_sysfs.c
> >>>> +++ b/drivers/gpu/drm/xe/xe_device_sysfs.c
> >>>> @@ -14,6 +14,7 @@
> >>>>  #include "xe_pcode_api.h"
> >>>>  #include "xe_pcode.h"
> >>>>  #include "xe_pm.h"
> >>>> +#include "xe_ttm_vram_mgr.h"
> >>>>  
> >>>>  /**
> >>>>   * DOC: Xe device sysfs
> >>>> @@ -284,6 +285,7 @@ int xe_device_sysfs_init(struct xe_device
> >>>> *xe)
> >>>>  		if (ret)
> >>>>  			return ret;
> >>>>  	}
> >>>> +	xe_ttm_vram_sysfs_init(xe);
> >>>>  
> >>>>  	return 0;
> >>>>  }
> >>>> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> >>>> b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> >>>> index cb3394000e83..c6a81ccaa9d2 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> >>>> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> >>>> @@ -744,3 +744,81 @@ int xe_ttm_tbo_handle_addr_fault(struct
> >>>> xe_tile *tile, unsigned long addr)
> >>>>  	return ret;
> >>>>  }
> >>>>  EXPORT_SYMBOL(xe_ttm_tbo_handle_addr_fault);
> >>>> +
> >>>> +static void xe_ttm_vram_dump_bad_pages_info(char *buf, struct
> >>>> xe_ttm_vram_mgr *mgr)
> >>>> +{
> >>>> +	const unsigned int element_size = sizeof("0xabcdabcd :
> >>>> 0x12345678 : R\n") - 1;
> >>>> +	struct xe_ttm_offline_resource *pos, *n;
> >>>> +	struct drm_buddy_block *block;
> >>>> +	ssize_t s = 0;
> >>>> +
> >>>> +	mutex_lock(&mgr->lock);
> >>>> +	list_for_each_entry_safe(pos, n, &mgr->offlined_pages,
> >>>> offlined_link) {
> >>>> +		block = list_first_entry(&pos->blocks,
> >>>> +					 struct drm_buddy_block,
> >>>> +					 link);
> >>>> +		s += scnprintf(&buf[s], element_size + 1,
> >>>> +			       "0x%08llx : 0x%08llx : %1s\n",
> >>>> +			       drm_buddy_block_offset(block) >>
> >>>> PAGE_SHIFT,
> >>>> +			       drm_buddy_block_size(&mgr->mm,
> >>>> block),
> >>>> +			       "R");
> >>>> +	}
> >>>> +	list_for_each_entry_safe(pos, n, &mgr->queued_pages,
> >>>> queued_link) {
> >>>> +		block = list_first_entry(&pos->blocks,
> >>>> +					 struct drm_buddy_block,
> >>>> +					 link);
> >>>> +		s += scnprintf(&buf[s], element_size + 1,
> >>>> +			       "0x%08llx : 0x%08llx : %1s\n",
> >>>> +			       drm_buddy_block_offset(block) >>
> >>>> PAGE_SHIFT,
> >>>> +			       drm_buddy_block_size(&mgr->mm,
> >>>> block),
> >>>> +			       "P");
> >>>> +	}
> >>>> +	mutex_unlock(&mgr->lock);
> >>>> +}
> >>>> +
> >>>> +static ssize_t vram_bad_pages_show(struct device *dev, struct
> >>>> device_attribute *attr, char *buf)
> >>>> +{
> >>>> +	struct pci_dev *pdev = to_pci_dev(dev);
> >>>> +	struct xe_device *xe = pdev_to_xe_device(pdev);
> >>>> +	struct ttm_resource_manager *man;
> >>>> +	u8 mem_type = XE_PL_VRAM1;
> >>>> +
> >>>> +	do {
> >>>> +		man = ttm_manager_type(&xe->ttm, mem_type);
> >>>> +		struct xe_ttm_vram_mgr *mgr =
> >>>> to_xe_ttm_vram_mgr(man);
> >>>> +
> >>>> +		if (man)
> >>>> +			xe_ttm_vram_dump_bad_pages_info(buf,
> >>>> mgr);
> >>>> +		--mem_type;
> >>>> +	} while (mem_type >= XE_PL_VRAM0);
> >>>> +
> >>>> +	return sysfs_emit(buf, "%s\n", buf);
> >>>> +}
> >>>> +static DEVICE_ATTR_RO(vram_bad_pages);
> >>>> +
> >>>> +static void xe_ttm_vram_sysfs_fini(void *arg)
> >>>> +{
> >>>> +	struct xe_device *xe = arg;
> >>>> +
> >>>> +	device_remove_file(xe->drm.dev,
> >>>> &dev_attr_vram_bad_pages);
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * xe_ttm_vram_sysfs_init - Initialize vram sysfs component
> >>>> + * @tile: Xe Tile object
> >>>> + *
> >>>> + * It needs to be initialized after the main tile component is
> >>>> ready
> >>>> + *
> >>>> + * Returns: 0 on success, negative error code on error.
> >>>> + */
> >>>> +int xe_ttm_vram_sysfs_init(struct xe_device *xe)
> >>>> +{
> >>>> +	int err;
> >>>> +
> >>>> +	err = device_create_file(xe->drm.dev,
> >>>> &dev_attr_vram_bad_pages);
> >>>> +	if (err)
> >>>> +		return 0;
> >>>> +
> >>>> +	return devm_add_action_or_reset(xe->drm.dev,
> >>>> xe_ttm_vram_sysfs_fini, xe);
> >>>> +}
> >>>> +EXPORT_SYMBOL(xe_ttm_vram_sysfs_init);
> >>>> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> >>>> b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> >>>> index 5872e8b48779..6e69140c0be8 100644
> >>>> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> >>>> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> >>>> @@ -34,6 +34,7 @@ void xe_ttm_vram_get_used(struct
> >>>> ttm_resource_manager *man,
> >>>>  int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned
> >>>> long addr);
> >>>>  void xe_ttm_vram_dump_allocated_blocks(struct drm_device *dev,
> >>>> struct drm_buddy *mm,
> >>>>  				       struct drm_printer *p);
> >>>> +int xe_ttm_vram_sysfs_init(struct xe_device *xe);
> >>>>  static inline struct xe_ttm_vram_mgr_resource *
> >>>>  to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
> >>>>  {
> >>>> -- 
> >>>> 2.52.0
> >>>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH 1/6] drm/xe/svm: Use res_to_mem_region
  2026-02-13  9:25 ` [RFC PATCH 1/6] drm/xe/svm: Use res_to_mem_region Tejas Upadhyay
@ 2026-02-24  2:16   ` Matthew Brost
  0 siblings, 0 replies; 13+ messages in thread
From: Matthew Brost @ 2026-02-24  2:16 UTC (permalink / raw)
  To: Tejas Upadhyay; +Cc: intel-xe, matthew.auld, himal.prasad.ghimiray

On Fri, Feb 13, 2026 at 02:55:54PM +0530, Tejas Upadhyay wrote:
> Replace the direct use of block->private with the helper function
> res_to_mem_region to get vram region.
> 
> V2(MattA): Use res_to_mem_region
> 
> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_bo.c  | 2 +-
>  drivers/gpu/drm/xe/xe_bo.h  | 1 +
>  drivers/gpu/drm/xe/xe_svm.c | 8 +-------
>  3 files changed, 3 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index cb8a177ec02b..70aca621c1a1 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -173,7 +173,7 @@ mem_type_to_migrate(struct xe_device *xe, u32 mem_type)
>  	return tile->migrate;
>  }
>  
> -static struct xe_vram_region *res_to_mem_region(struct ttm_resource *res)
> +struct xe_vram_region *res_to_mem_region(struct ttm_resource *res)

We need kernel doc now, perhaps a better name for a public function too.

>  {
>  	struct xe_device *xe = ttm_to_xe_device(res->bo->bdev);
>  	struct ttm_resource_manager *mgr;
> diff --git a/drivers/gpu/drm/xe/xe_bo.h b/drivers/gpu/drm/xe/xe_bo.h
> index c914ab719f20..393f1b4faf99 100644
> --- a/drivers/gpu/drm/xe/xe_bo.h
> +++ b/drivers/gpu/drm/xe/xe_bo.h
> @@ -311,6 +311,7 @@ int xe_bo_dumb_create(struct drm_file *file_priv,
>  		      struct drm_mode_create_dumb *args);
>  
>  bool xe_bo_needs_ccs_pages(struct xe_bo *bo);
> +struct xe_vram_region *res_to_mem_region(struct ttm_resource *res);
>  
>  static inline size_t xe_bo_ccs_pages_start(struct xe_bo *bo)
>  {
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index 213f0334518a..8015eb6fcbc9 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -762,7 +762,7 @@ static int xe_svm_populate_devmem_pfn(struct drm_pagemap_devmem *devmem_allocati
>  	int j = 0;
>  
>  	list_for_each_entry(block, blocks, link) {
> -		struct xe_vram_region *vr = block->private;
> +		struct xe_vram_region *vr = res_to_mem_region(res);
>  		struct drm_buddy *buddy = vram_to_buddy(vr);

vr, buddy can now we declared outside of the loop.

Matt

>  		u64 block_pfn = block_offset_to_pfn(devmem_allocation->dpagemap,
>  						    drm_buddy_block_offset(block));
> @@ -1033,9 +1033,7 @@ static int xe_drm_pagemap_populate_mm(struct drm_pagemap *dpagemap,
>  	struct dma_fence *pre_migrate_fence = NULL;
>  	struct xe_device *xe = vr->xe;
>  	struct device *dev = xe->drm.dev;
> -	struct drm_buddy_block *block;
>  	struct xe_validation_ctx vctx;
> -	struct list_head *blocks;
>  	struct drm_exec exec;
>  	struct xe_bo *bo;
>  	int err = 0, idx;
> @@ -1072,10 +1070,6 @@ static int xe_drm_pagemap_populate_mm(struct drm_pagemap *dpagemap,
>  					&dpagemap_devmem_ops, dpagemap, end - start,
>  					pre_migrate_fence);
>  
> -		blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
> -		list_for_each_entry(block, blocks, link)
> -			block->private = vr;
> -
>  		xe_bo_get(bo);
>  
>  		/* Ensure the device has a pm ref while there are device pages active. */
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-02-24  2:17 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-13  9:25 [RFC PATCH 0/6] Add memory page offlining support Tejas Upadhyay
2026-02-13  9:25 ` [RFC PATCH 1/6] drm/xe/svm: Use res_to_mem_region Tejas Upadhyay
2026-02-24  2:16   ` Matthew Brost
2026-02-13  9:25 ` [RFC PATCH 2/6] drm/xe: Implement VRAM object tracking ability using physical address Tejas Upadhyay
2026-02-13  9:25 ` [RFC PATCH 3/6] drm/xe: Handle physical memory address error Tejas Upadhyay
2026-02-13  9:25 ` [RFC PATCH 4/6] [DO NOT REVIEW]drm/xe/cri: Add debugfs to inject faulty vram address Tejas Upadhyay
2026-02-13  9:25 ` [RFC PATCH 5/6] drm/xe: Add routine to dump allocated VRAM blocks Tejas Upadhyay
2026-02-13  9:25 ` [RFC PATCH 6/6] [DO NOT REVIEW]]drm/xe/cri: Add sysfs interface for bad gpu vram pages Tejas Upadhyay
2026-02-18  0:37   ` Rodrigo Vivi
2026-02-20 11:18     ` Aravind Iddamsetty
2026-02-20 14:52       ` Vivi, Rodrigo
2026-02-22  5:32         ` Aravind Iddamsetty
2026-02-23 21:26           ` Rodrigo Vivi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox