[RFC PATCH 0/7] Add memory page offlining support

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/7] Add memory page offlining support
@ 2026-02-12 16:34 Tejas Upadhyay
  2026-02-12 16:34 ` [RFC PATCH 1/7] drm/xe: Add a helper to get vram region from physical address Tejas Upadhyay
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: Tejas Upadhyay @ 2026-02-12 16:34 UTC (permalink / raw)
  To: intel-xe; +Cc: matthew.auld, matthew.brost, himal.prasad.ghimiray,
	Tejas Upadhyay

This functionality represents a significant step in making
the xe driver gracefully handle hardware memory degradation.
By integrating with the DRM Buddy allocator, the driver
can permanently "carve out" faulty memory so it isn't reused
by subsequent allocations.

This series adds memory page offlining support with following:
1. Add a helper to get vram region from physical address to used in second patch
2. drm/xe/svm: Use xe_vram_addr_to_region, avoid block->private usage
3. Link and track ttm BO's with physical addresses
4. Handle the generated physical address error by reserving addresses 4K page
5. Adds supporting debugfs to inject manual physcal address error
6. Add buddy block allocation dump for debuggin buddy related issues
7. Sysfs entry to provide statistics of bad gpu vram pages for user info

Opens:
1. dump_allocated_blocks() and xe_ttm_vram_addr_to_tbo() API will move under drm_buddy,
right now just to showcase concept its part of xe code

V2: 
- some fixes and clean up on errors
- Added xe_vram_addr_to_region helper to avoid other use of block->private (MattB)

Tejas Upadhyay (7):
  drm/xe: Add a helper to get vram region from physical address
  drm/xe/svm: Use xe_vram_addr_to_region
  drm/xe: Implement VRAM object tracking ability using physical address
  drm/xe: Handle physical memory address error
  drm/xe/cri: Add debugfs to inject faulty vram address
  drm/xe: Add routine to dump allocated VRAM blocks
  [DO NOT REVIEW]drm/xe/cri: Add sysfs interface for bad gpu vram pages

 drivers/gpu/drm/xe/xe_debugfs.c            |  49 +++
 drivers/gpu/drm/xe/xe_device_sysfs.c       |   2 +
 drivers/gpu/drm/xe/xe_svm.c                |   9 +-
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       | 355 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.h       |   6 +-
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  23 ++
 drivers/gpu/drm/xe/xe_vram.c               |  16 +
 drivers/gpu/drm/xe/xe_vram.h               |   1 +
 8 files changed, 453 insertions(+), 8 deletions(-)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/7] drm/xe: Add a helper to get vram region from physical address
  2026-02-12 16:34 [RFC PATCH 0/7] Add memory page offlining support Tejas Upadhyay
@ 2026-02-12 16:34 ` Tejas Upadhyay
  2026-02-12 16:34 ` [RFC PATCH 2/7] drm/xe/svm: Use xe_vram_addr_to_region Tejas Upadhyay
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Tejas Upadhyay @ 2026-02-12 16:34 UTC (permalink / raw)
  To: intel-xe; +Cc: matthew.auld, matthew.brost, himal.prasad.ghimiray,
	Tejas Upadhyay

Adding a helper function to retrieve a VRAM region from a physical address
is a necessary enhancement for supporting shared virtual memory (SVM) and
advanced memory management features like memory offlining.

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_vram.c | 16 ++++++++++++++++
 drivers/gpu/drm/xe/xe_vram.h |  1 +
 2 files changed, 17 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_vram.c b/drivers/gpu/drm/xe/xe_vram.c
index 0538dcb8b18c..0768b4075bc9 100644
--- a/drivers/gpu/drm/xe/xe_vram.c
+++ b/drivers/gpu/drm/xe/xe_vram.c
@@ -365,3 +365,19 @@ resource_size_t xe_vram_region_actual_physical_size(const struct xe_vram_region
 	return vram ? vram->actual_physical_size : 0;
 }
 EXPORT_SYMBOL_IF_KUNIT(xe_vram_region_actual_physical_size);
+
+struct xe_vram_region *xe_vram_addr_to_region(struct xe_device *xe, resource_size_t addr)
+{
+	struct xe_vram_region *vr;
+	struct xe_tile *tile;
+	int id;
+
+	for_each_tile(tile, xe, id) {
+		vr = tile->mem.vram;
+		if ((addr <= vr->dpa_base + vr->actual_physical_size) &&
+		    (addr + SZ_4K >= vr->dpa_base))
+			return vr;
+	}
+	return NULL;
+}
+EXPORT_SYMBOL(xe_vram_addr_to_region);
diff --git a/drivers/gpu/drm/xe/xe_vram.h b/drivers/gpu/drm/xe/xe_vram.h
index 72860f714fc6..49ef89acba1c 100644
--- a/drivers/gpu/drm/xe/xe_vram.h
+++ b/drivers/gpu/drm/xe/xe_vram.h
@@ -20,5 +20,6 @@ resource_size_t xe_vram_region_io_size(const struct xe_vram_region *vram);
 resource_size_t xe_vram_region_dpa_base(const struct xe_vram_region *vram);
 resource_size_t xe_vram_region_usable_size(const struct xe_vram_region *vram);
 resource_size_t xe_vram_region_actual_physical_size(const struct xe_vram_region *vram);
+struct xe_vram_region *xe_vram_addr_to_region(struct xe_device *xe, resource_size_t addr);
 
 #endif
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/7] drm/xe/svm: Use xe_vram_addr_to_region
  2026-02-12 16:34 [RFC PATCH 0/7] Add memory page offlining support Tejas Upadhyay
  2026-02-12 16:34 ` [RFC PATCH 1/7] drm/xe: Add a helper to get vram region from physical address Tejas Upadhyay
@ 2026-02-12 16:34 ` Tejas Upadhyay
  2026-02-12 17:53   ` Matthew Auld
  2026-02-12 16:34 ` [RFC PATCH 3/7] drm/xe: Implement VRAM object tracking ability using physical address Tejas Upadhyay
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 10+ messages in thread
From: Tejas Upadhyay @ 2026-02-12 16:34 UTC (permalink / raw)
  To: intel-xe; +Cc: matthew.auld, matthew.brost, himal.prasad.ghimiray,
	Tejas Upadhyay

Replace the direct use of block->private with the helper function
xe_vram_addr_to_region to get vram region.

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 213f0334518a..e773456af040 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -762,7 +762,8 @@ static int xe_svm_populate_devmem_pfn(struct drm_pagemap_devmem *devmem_allocati
 	int j = 0;
 
 	list_for_each_entry(block, blocks, link) {
-		struct xe_vram_region *vr = block->private;
+		u64 block_start = drm_buddy_block_offset(block);
+		struct xe_vram_region *vr = xe_vram_addr_to_region(bo->tile->xe, block_start);
 		struct drm_buddy *buddy = vram_to_buddy(vr);
 		u64 block_pfn = block_offset_to_pfn(devmem_allocation->dpagemap,
 						    drm_buddy_block_offset(block));
@@ -1033,9 +1034,7 @@ static int xe_drm_pagemap_populate_mm(struct drm_pagemap *dpagemap,
 	struct dma_fence *pre_migrate_fence = NULL;
 	struct xe_device *xe = vr->xe;
 	struct device *dev = xe->drm.dev;
-	struct drm_buddy_block *block;
 	struct xe_validation_ctx vctx;
-	struct list_head *blocks;
 	struct drm_exec exec;
 	struct xe_bo *bo;
 	int err = 0, idx;
@@ -1072,10 +1071,6 @@ static int xe_drm_pagemap_populate_mm(struct drm_pagemap *dpagemap,
 					&dpagemap_devmem_ops, dpagemap, end - start,
 					pre_migrate_fence);
 
-		blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
-		list_for_each_entry(block, blocks, link)
-			block->private = vr;
-
 		xe_bo_get(bo);
 
 		/* Ensure the device has a pm ref while there are device pages active. */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/7] drm/xe: Implement VRAM object tracking ability using physical address
  2026-02-12 16:34 [RFC PATCH 0/7] Add memory page offlining support Tejas Upadhyay
  2026-02-12 16:34 ` [RFC PATCH 1/7] drm/xe: Add a helper to get vram region from physical address Tejas Upadhyay
  2026-02-12 16:34 ` [RFC PATCH 2/7] drm/xe/svm: Use xe_vram_addr_to_region Tejas Upadhyay
@ 2026-02-12 16:34 ` Tejas Upadhyay
  2026-02-12 16:34 ` [RFC PATCH 4/7] drm/xe: Handle physical memory address error Tejas Upadhyay
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Tejas Upadhyay @ 2026-02-12 16:34 UTC (permalink / raw)
  To: intel-xe; +Cc: matthew.auld, matthew.brost, himal.prasad.ghimiray,
	Tejas Upadhyay

Implement the capability to track and identify TTM buffer objects
using a specific faulty memory address in VRAM. This functionality
is critical for supporting the memory page offline feature on CRI,
where identified faulty pages must be traced back to their
originating buffer for safe removal.

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 75 ++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.h |  2 +-
 2 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index d6aa61e55f4d..4e852eed5170 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -56,6 +56,7 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager *man,
 	u64 size, min_page_size;
 	unsigned long lpfn;
 	int err;
+	struct drm_buddy_block *block;
 
 	lpfn = place->lpfn;
 	if (!lpfn || lpfn > man->size >> PAGE_SHIFT)
@@ -137,6 +138,8 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager *man,
 	}
 
 	mgr->visible_avail -= vres->used_visible_size;
+	list_for_each_entry(block, &vres->blocks, link)
+		block->private = tbo;
 	mutex_unlock(&mgr->lock);
 
 	if (!(vres->base.placement & TTM_PL_FLAG_CONTIGUOUS) &&
@@ -467,3 +470,75 @@ u64 xe_ttm_vram_get_avail(struct ttm_resource_manager *man)
 
 	return avail;
 }
+
+static inline bool overlaps(u64 s1, u64 e1, u64 s2, u64 e2)
+{
+	return s1 <= e2 && e1 >= s2;
+}
+
+static inline bool contains(u64 s1, u64 e1, u64 s2, u64 e2)
+{
+	return s1 <= s2 && e1 <= e2;
+}
+
+static struct ttm_buffer_object *xe_ttm_vram_addr_to_tbo(struct drm_buddy *mm, u64 start)
+{
+	struct drm_buddy_block *block;
+	u64 end;
+	LIST_HEAD(dfs);
+	int i;
+
+	end = start + SZ_4K - 1;
+	for (i = 0; i < mm->n_roots; ++i)
+		list_add_tail(&mm->roots[i]->tmp_link, &dfs);
+
+	do {
+		u64 block_start;
+		u64 block_end;
+
+		block = list_first_entry_or_null(&dfs,
+						 struct drm_buddy_block,
+						 tmp_link);
+		if (!block)
+			break;
+
+		list_del(&block->tmp_link);
+
+		block_start = drm_buddy_block_offset(block);
+		block_end = block_start + drm_buddy_block_size(mm, block) - 1;
+
+		if (!overlaps(start, end, block_start, block_end))
+			continue;
+
+		if (contains(start, end, block_start, block_end) &&
+		    !drm_buddy_block_is_split(block)) {
+			if (drm_buddy_block_is_free(block)) {
+				return NULL;
+			} else if (drm_buddy_block_is_allocated(block) && !mm->clear_avail) {
+				struct ttm_buffer_object *tbo = block->private;
+
+				WARN_ON(!tbo);
+				return tbo;
+			}
+		}
+
+		if (drm_buddy_block_is_split(block)) {
+			list_add(&block->right->tmp_link, &dfs);
+			list_add(&block->left->tmp_link, &dfs);
+		}
+	} while (1);
+
+	return NULL;
+}
+
+int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr)
+{
+	struct xe_ttm_vram_mgr *vram_mgr = &tile->mem.vram->ttm;
+	struct drm_buddy mm = vram_mgr->mm;
+	struct ttm_buffer_object *tbo;
+
+	tbo = xe_ttm_vram_addr_to_tbo(&mm, addr);
+
+	return 0;
+}
+EXPORT_SYMBOL(xe_ttm_tbo_handle_addr_fault);
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
index 87b7fae5edba..1d6075411ebf 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
@@ -30,7 +30,7 @@ u64 xe_ttm_vram_get_avail(struct ttm_resource_manager *man);
 u64 xe_ttm_vram_get_cpu_visible_size(struct ttm_resource_manager *man);
 void xe_ttm_vram_get_used(struct ttm_resource_manager *man,
 			  u64 *used, u64 *used_visible);
-
+int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr);
 static inline struct xe_ttm_vram_mgr_resource *
 to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
 {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 4/7] drm/xe: Handle physical memory address error
  2026-02-12 16:34 [RFC PATCH 0/7] Add memory page offlining support Tejas Upadhyay
                   ` (2 preceding siblings ...)
  2026-02-12 16:34 ` [RFC PATCH 3/7] drm/xe: Implement VRAM object tracking ability using physical address Tejas Upadhyay
@ 2026-02-12 16:34 ` Tejas Upadhyay
  2026-02-12 16:34 ` [RFC PATCH 5/7] drm/xe/cri: Add debugfs to inject faulty vram address Tejas Upadhyay
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Tejas Upadhyay @ 2026-02-12 16:34 UTC (permalink / raw)
  To: intel-xe; +Cc: matthew.auld, matthew.brost, himal.prasad.ghimiray,
	Tejas Upadhyay

This functionality represents a significant step in making
the xe driver gracefully handle hardware memory degradation.
By integrating with the DRM Buddy allocator, the driver
can permanently "carve out" faulty memory so it isn't reused
by subsequent allocations.

Buddy Block Reservation:
----------------------
When a memory address is reported as faulty, the driver instructs
the DRM Buddy allocator to reserve a block of the specific page
size (typically 4KB). This marks the memory as "dirty/used"
indefinitely.

Two-Stage Tracking:
-----------------
Offlined Pages:
Pages that have been successfully isolated and removed from the
available memory pool.

Queued Pages:
Addresses that have been flagged as faulty but are currently in
use by a process. These are tracked until the associated buffer
object (BO) is released or migrated, at which point they move
to the "offlined" state.

Sysfs Reporting:
--------------
The patch exposes these metrics through a standard interface,
allowing administrators to monitor VRAM health:
/sys/bus/pci/devices/<device_id>/vram_bad_bad_pages

V2:
- Fix mm->avail counter issue
- Remove unused code and handle clean up in case of error

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       | 184 ++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  21 +++
 2 files changed, 199 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index 4e852eed5170..82d11348e5ce 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -276,6 +276,26 @@ static const struct ttm_resource_manager_func xe_ttm_vram_mgr_func = {
 	.debug	= xe_ttm_vram_mgr_debug
 };
 
+static void xe_ttm_vram_free_bad_pages(struct drm_device *dev, struct xe_ttm_vram_mgr *mgr)
+{
+	struct xe_ttm_offline_resource *pos, *n;
+
+	mutex_lock(&mgr->lock);
+	list_for_each_entry_safe(pos, n, &mgr->offlined_pages, offlined_link) {
+		--mgr->n_offlined_pages;
+		drm_buddy_free_list(&mgr->mm, &pos->blocks, 0);
+		mgr->visible_avail += pos->used_visible_size;
+		list_del(&pos->offlined_link);
+		kfree(pos);
+	}
+	list_for_each_entry_safe(pos, n, &mgr->queued_pages, queued_link) {
+		list_del(&pos->queued_link);
+		mgr->n_queued_pages--;
+		kfree(pos);
+	}
+	mutex_unlock(&mgr->lock);
+}
+
 static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
 {
 	struct xe_device *xe = to_xe_device(dev);
@@ -287,6 +307,8 @@ static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
 	if (ttm_resource_manager_evict_all(&xe->ttm, man))
 		return;
 
+	xe_ttm_vram_free_bad_pages(dev, mgr);
+
 	WARN_ON_ONCE(mgr->visible_avail != mgr->visible_size);
 
 	drm_buddy_fini(&mgr->mm);
@@ -315,6 +337,8 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
 	man->func = &xe_ttm_vram_mgr_func;
 	mgr->mem_type = mem_type;
 	mutex_init(&mgr->lock);
+	INIT_LIST_HEAD(&mgr->offlined_pages);
+	INIT_LIST_HEAD(&mgr->queued_pages);
 	mgr->default_page_size = default_page_size;
 	mgr->visible_size = io_size;
 	mgr->visible_avail = io_size;
@@ -531,14 +555,162 @@ static struct ttm_buffer_object *xe_ttm_vram_addr_to_tbo(struct drm_buddy *mm, u
 	return NULL;
 }
 
-int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr)
+static int xe_ttm_vram_reserve_page_at_addr(struct xe_device *xe, unsigned long addr,
+					    struct xe_ttm_vram_mgr *vram_mgr, struct drm_buddy *mm)
 {
-	struct xe_ttm_vram_mgr *vram_mgr = &tile->mem.vram->ttm;
-	struct drm_buddy mm = vram_mgr->mm;
-	struct ttm_buffer_object *tbo;
+	int ret = 0;
+	u64 size = SZ_4K;
+	struct ttm_buffer_object *tbo = NULL;
+	struct xe_ttm_offline_resource *nentry;
 
-	tbo = xe_ttm_vram_addr_to_tbo(&mm, addr);
+	mutex_lock(&vram_mgr->lock);
+	tbo = xe_ttm_vram_addr_to_tbo(mm, addr);
 
-	return 0;
+	nentry = kzalloc(sizeof(*nentry), GFP_KERNEL);
+	if (!nentry)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&nentry->blocks);
+
+	if (tbo) {
+		struct xe_ttm_vram_mgr_resource *pvres;
+		struct ttm_placement place = {};
+		struct ttm_operation_ctx ctx = {
+			.interruptible = false,
+			.gfp_retry_mayfail = false,
+		};
+		bool locked;
+		struct xe_ttm_offline_resource *pos, *n;
+		struct xe_bo *pbo = ttm_to_xe_bo(tbo);
+
+		xe_bo_get(pbo);
+		/* Critical kernel BO? */
+		if (pbo->ttm.type == ttm_bo_type_kernel &&
+		    !(pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM)) {
+			mutex_unlock(&vram_mgr->lock);
+			kfree(nentry);
+			xe_ttm_vram_free_bad_pages(&xe->drm, vram_mgr);
+			xe_bo_put(pbo);
+			drm_warn(&xe->drm,
+				 "%s: corrupt addr: 0x%lx in critical kernel bo, wedge now\n",
+				__func__, addr);
+			/* Wedge the device */
+			xe_device_declare_wedged(xe);
+			return -EIO;
+		}
+		pvres = to_xe_ttm_vram_mgr_resource(pbo->ttm.resource);
+		nentry->id = ++vram_mgr->n_queued_pages;
+		nentry->blocks = pvres->blocks;
+		list_add(&nentry->queued_link, &vram_mgr->queued_pages);
+		mutex_unlock(&vram_mgr->lock);
+
+		/* Purge BO containing address */
+		spin_lock(&pbo->ttm.bdev->lru_lock);
+		locked = dma_resv_trylock(pbo->ttm.base.resv);
+		spin_unlock(&pbo->ttm.bdev->lru_lock);
+		WARN_ON(!locked);
+		ret = ttm_bo_validate(&pbo->ttm, &place, &ctx);
+		drm_WARN_ON(&xe->drm, ret);
+		xe_bo_put(pbo);
+		if (locked)
+			dma_resv_unlock(pbo->ttm.base.resv);
+
+		/* Reserve page at address addr*/
+		mutex_lock(&vram_mgr->lock);
+		ret = drm_buddy_alloc_blocks(mm, addr, addr + size,
+					     size, size, &nentry->blocks,
+					     DRM_BUDDY_RANGE_ALLOCATION);
+
+		if (ret) {
+			drm_warn(&xe->drm, "Could not reserve page at addr:0x%lx, ret:%d\n",
+				 addr, ret);
+			mutex_unlock(&vram_mgr->lock);
+			return ret;
+		}
+		if ((addr + size) <= vram_mgr->visible_size) {
+			nentry->used_visible_size = size;
+		} else {
+			struct drm_buddy_block *block;
+
+			list_for_each_entry(block, &nentry->blocks, link) {
+				u64 start = drm_buddy_block_offset(block);
+
+				if (start < vram_mgr->visible_size) {
+					u64 end = start + drm_buddy_block_size(mm, block);
+
+					nentry->used_visible_size +=
+						min(end, vram_mgr->visible_size) - start;
+				}
+			}
+		}
+		vram_mgr->visible_avail -= nentry->used_visible_size;
+		list_for_each_entry_safe(pos, n, &vram_mgr->queued_pages, queued_link) {
+			if (pos->id == nentry->id) {
+				--vram_mgr->n_queued_pages;
+				list_del(&pos->queued_link);
+				break;
+			}
+		}
+		list_add(&nentry->offlined_link, &vram_mgr->offlined_pages);
+		++vram_mgr->n_offlined_pages;
+		mutex_unlock(&vram_mgr->lock);
+		return ret;
+
+	} else {
+		ret = drm_buddy_alloc_blocks(mm, addr, addr + size,
+					     size, size, &nentry->blocks,
+					     DRM_BUDDY_RANGE_ALLOCATION);
+		if (ret) {
+			drm_warn(&xe->drm, "Could not reserve page at addr:0x%lx, ret:%d\n",
+				 addr, ret);
+			kfree(nentry);
+			mutex_unlock(&vram_mgr->lock);
+			return ret;
+		}
+		if ((addr + size) <= vram_mgr->visible_size) {
+			nentry->used_visible_size = size;
+		} else {
+			struct drm_buddy_block *block;
+
+			list_for_each_entry(block, &nentry->blocks, link) {
+				u64 start = drm_buddy_block_offset(block);
+
+				if (start < vram_mgr->visible_size) {
+					u64 end = start + drm_buddy_block_size(mm, block);
+
+					nentry->used_visible_size +=
+						min(end, vram_mgr->visible_size) - start;
+				}
+			}
+		}
+		vram_mgr->visible_avail -= nentry->used_visible_size;
+		nentry->id = ++vram_mgr->n_offlined_pages;
+		list_add(&nentry->offlined_link, &vram_mgr->offlined_pages);
+		mutex_unlock(&vram_mgr->lock);
+	}
+	/* Success */
+	return ret;
+}
+
+/**
+ * xe_ttm_tbo_handle_addr_fault - Handle vram physical address error flaged
+ * @tile: pointer to tile where address belongs
+ * @addr: physical faulty address
+ *
+ * Handle the physcial faulty address error on specific tile.
+ *
+ * Returns 0 for success, negative error code otherwise.
+ */
+int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr)
+{
+	struct xe_ttm_vram_mgr *vram_mgr = &tile->mem.vram->ttm;
+	struct xe_device *xe = tile_to_xe(tile);
+	struct drm_buddy *mm = &vram_mgr->mm;
+	int ret;
+
+	 /* Reserve page at address */
+	ret = xe_ttm_vram_reserve_page_at_addr(xe, addr, vram_mgr, mm);
+	if (ret == -EIO)
+		return 0; /* success, wedged by kernel. */
+	return ret;
 }
 EXPORT_SYMBOL(xe_ttm_tbo_handle_addr_fault);
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
index a71e14818ec2..85511b51af75 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
@@ -19,6 +19,14 @@ struct xe_ttm_vram_mgr {
 	struct ttm_resource_manager manager;
 	/** @mm: DRM buddy allocator which manages the VRAM */
 	struct drm_buddy mm;
+	/** @offlined_pages: List of offlined pages */
+	struct list_head offlined_pages;
+	/** @n_offlined_pages: Number of offlined pages */
+	u16 n_offlined_pages;
+	/** @queued_pages: List of queued pages */
+	struct list_head queued_pages;
+       /** @n_queued_pages: Number of queued pages */
+	u16 n_queued_pages;
 	/** @visible_size: Proped size of the CPU visible portion */
 	u64 visible_size;
 	/** @visible_avail: CPU visible portion still unallocated */
@@ -45,4 +53,17 @@ struct xe_ttm_vram_mgr_resource {
 	unsigned long flags;
 };
 
+struct xe_ttm_offline_resource {
+	/** @offlined_link: Link to offlined pages */
+	struct list_head offlined_link;
+	/** @queued_link: Link to queued pages */
+	struct list_head queued_link;
+	/** @blocks: list of DRM buddy blocks */
+	struct list_head blocks;
+	/** @used_visible_size: How many CPU visible bytes this resource is using */
+	u64 used_visible_size;
+	/** @id: The id of an offline resource */
+	u16 id;
+};
+
 #endif
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 5/7] drm/xe/cri: Add debugfs to inject faulty vram address
  2026-02-12 16:34 [RFC PATCH 0/7] Add memory page offlining support Tejas Upadhyay
                   ` (3 preceding siblings ...)
  2026-02-12 16:34 ` [RFC PATCH 4/7] drm/xe: Handle physical memory address error Tejas Upadhyay
@ 2026-02-12 16:34 ` Tejas Upadhyay
  2026-02-12 16:34 ` [RFC PATCH 6/7] drm/xe: Add routine to dump allocated VRAM blocks Tejas Upadhyay
  2026-02-12 16:34 ` [RFC PATCH 7/7] [DO NOT REVIEW]drm/xe/cri: Add sysfs interface for bad gpu vram pages Tejas Upadhyay
  6 siblings, 0 replies; 10+ messages in thread
From: Tejas Upadhyay @ 2026-02-12 16:34 UTC (permalink / raw)
  To: intel-xe; +Cc: matthew.auld, matthew.brost, himal.prasad.ghimiray,
	Tejas Upadhyay

Add debugfs which can help testing feature with manual error injection.
Adding a debugfs interface to the drm/xe driver allows manual injection
of faulty VRAM addresses, facilitating the testing of the CRI memory
page offline feature before it is fully functional. The implementation
involves creating a debugfs entry, likely under
/sys/kernel/debug/dri/bdf/invalid_addr_vram0,
to accept specific faulty addresses for validation.

For example,
echo 0x1000 > /sys/kernel/debug/dri/bdf/invalid_addr_vram0
where 0x1000 is faulty adress being injected.

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_debugfs.c            | 49 ++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  2 +
 2 files changed, 51 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/xe_debugfs.c
index 844cfafe1ec7..d9dc1acebbce 100644
--- a/drivers/gpu/drm/xe/xe_debugfs.c
+++ b/drivers/gpu/drm/xe/xe_debugfs.c
@@ -27,6 +27,7 @@
 #include "xe_sriov_vf.h"
 #include "xe_step.h"
 #include "xe_tile_debugfs.h"
+#include "xe_ttm_vram_mgr.h"
 #include "xe_vsec.h"
 #include "xe_wa.h"
 
@@ -509,12 +510,48 @@ static const struct file_operations disable_late_binding_fops = {
 	.write = disable_late_binding_set,
 };
 
+static ssize_t addr_fault_reporting_show(struct file *f, char __user *ubuf,
+					 size_t size, loff_t *pos)
+{
+	struct xe_device *xe = file_inode(f)->i_private;
+	char buf[32];
+	int len;
+
+	len = scnprintf(buf, sizeof(buf), "%lld\n", xe->mem.vram->ttm.fault_addr);
+
+	return simple_read_from_buffer(ubuf, size, pos, buf, len);
+}
+
+static ssize_t addr_fault_reporting_set(struct file *f, const char __user *ubuf,
+					size_t size, loff_t *pos)
+{
+	struct xe_device *xe = file_inode(f)->i_private;
+	u64 addr;
+	int ret;
+
+	ret = kstrtou64_from_user(ubuf, size, 0, &addr);
+	if (ret)
+		return ret;
+
+	xe->mem.vram->ttm.fault_addr = addr;
+	xe_ttm_tbo_handle_addr_fault(xe_device_get_root_tile(xe), xe->mem.vram->ttm.fault_addr);
+
+	return size;
+}
+
+static const struct file_operations addr_fault_reporting_fops = {
+	.owner = THIS_MODULE,
+	.read = addr_fault_reporting_show,
+	.write = addr_fault_reporting_set,
+};
+
 void xe_debugfs_register(struct xe_device *xe)
 {
 	struct ttm_device *bdev = &xe->ttm;
 	struct drm_minor *minor = xe->drm.primary;
 	struct dentry *root = minor->debugfs_root;
 	struct ttm_resource_manager *man;
+	u8 mem_type = XE_PL_VRAM1;
 	struct xe_tile *tile;
 	struct xe_gt *gt;
 	u8 tile_id;
@@ -565,6 +602,18 @@ void xe_debugfs_register(struct xe_device *xe)
 	if (man)
 		ttm_resource_manager_create_debugfs(man, root, "stolen_mm");
 
+	do {
+		man = ttm_manager_type(bdev, mem_type);
+		if (man) {
+			char name[20];
+
+			snprintf(name, sizeof(name), "invalid_addr_vram%d", mem_type - XE_PL_VRAM0);
+			debugfs_create_file(name, 0600, root, xe,
+					    &addr_fault_reporting_fops);
+		}
+		--mem_type;
+	} while (mem_type >= XE_PL_VRAM0);
+
 	for_each_tile(tile, xe, tile_id)
 		xe_tile_debugfs_register(tile);
 
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
index 85511b51af75..c93573b9aab2 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
@@ -37,6 +37,8 @@ struct xe_ttm_vram_mgr {
 	struct mutex lock;
 	/** @mem_type: The TTM memory type */
 	u32 mem_type;
+	/** @fault_addr: debugfs hook for setting faulty address */
+	u64 fault_addr;
 };
 
 /**
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 6/7] drm/xe: Add routine to dump allocated VRAM blocks
  2026-02-12 16:34 [RFC PATCH 0/7] Add memory page offlining support Tejas Upadhyay
                   ` (4 preceding siblings ...)
  2026-02-12 16:34 ` [RFC PATCH 5/7] drm/xe/cri: Add debugfs to inject faulty vram address Tejas Upadhyay
@ 2026-02-12 16:34 ` Tejas Upadhyay
  2026-02-12 16:34 ` [RFC PATCH 7/7] [DO NOT REVIEW]drm/xe/cri: Add sysfs interface for bad gpu vram pages Tejas Upadhyay
  6 siblings, 0 replies; 10+ messages in thread
From: Tejas Upadhyay @ 2026-02-12 16:34 UTC (permalink / raw)
  To: intel-xe; +Cc: matthew.auld, matthew.brost, himal.prasad.ghimiray,
	Tejas Upadhyay

To implement the ability to see allocated blocks under a specific VRAM
instance in the drm/xe driver, new api is introduced. While existing structs
often show the free block list, this addition provides a comprehensive view
of all currently resident VRAM allocations.

Dump will look like,

[  +0.000003] xe 0000:03:00.0: [drm] 0x00000002f8000000-0x00000002f8800000: 8388608
[  +0.000005] xe 0000:03:00.0: [drm] 0x00000002f8800000-0x00000002f8840000: 262144
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f8840000-0x00000002f8860000: 131072
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f8860000-0x00000002f8870000: 65536
[  +0.000005] xe 0000:03:00.0: [drm] 0x00000002f9000000-0x00000002f9800000: 8388608
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f9800000-0x00000002f9880000: 524288
[  +0.000005] xe 0000:03:00.0: [drm] 0x00000002f9880000-0x00000002f9884000: 16384
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f9900000-0x00000002f9980000: 524288
[  +0.000005] xe 0000:03:00.0: [drm] 0x00000002f9980000-0x00000002f9988000: 32768
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f9988000-0x00000002f998c000: 16384

V2: remove unused code

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 30 ++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.h |  3 +++
 2 files changed, 33 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index 82d11348e5ce..cb3394000e83 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -505,6 +505,36 @@ static inline bool contains(u64 s1, u64 e1, u64 s2, u64 e2)
 	return s1 <= s2 && e1 <= e2;
 }
 
+void xe_ttm_vram_dump_allocated_blocks(struct drm_device *dev, struct drm_buddy *mm,
+				       struct drm_printer *p)
+{
+	struct drm_buddy_block *block;
+	LIST_HEAD(dfs);
+	int i;
+
+	for (i = 0; i < mm->n_roots; ++i)
+		list_add_tail(&mm->roots[i]->tmp_link, &dfs);
+
+	do {
+		block = list_first_entry_or_null(&dfs,
+						 struct drm_buddy_block,
+						 tmp_link);
+		if (!block)
+			break;
+
+		list_del(&block->tmp_link);
+
+		if (drm_buddy_block_is_allocated(block))
+			drm_buddy_block_print(mm, block, p);
+
+		if (drm_buddy_block_is_split(block)) {
+			list_add(&block->right->tmp_link, &dfs);
+			list_add(&block->left->tmp_link, &dfs);
+		}
+	} while (1);
+}
+EXPORT_SYMBOL(xe_ttm_vram_dump_allocated_blocks);
+
 static struct ttm_buffer_object *xe_ttm_vram_addr_to_tbo(struct drm_buddy *mm, u64 start)
 {
 	struct drm_buddy_block *block;
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
index 1d6075411ebf..5872e8b48779 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
@@ -12,6 +12,7 @@ enum dma_data_direction;
 struct xe_device;
 struct xe_tile;
 struct xe_vram_region;
+struct drm_device;
 
 int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
 			   u32 mem_type, u64 size, u64 io_size,
@@ -31,6 +32,8 @@ u64 xe_ttm_vram_get_cpu_visible_size(struct ttm_resource_manager *man);
 void xe_ttm_vram_get_used(struct ttm_resource_manager *man,
 			  u64 *used, u64 *used_visible);
 int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr);
+void xe_ttm_vram_dump_allocated_blocks(struct drm_device *dev, struct drm_buddy *mm,
+				       struct drm_printer *p);
 static inline struct xe_ttm_vram_mgr_resource *
 to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
 {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 7/7] [DO NOT REVIEW]drm/xe/cri: Add sysfs interface for bad gpu vram pages
  2026-02-12 16:34 [RFC PATCH 0/7] Add memory page offlining support Tejas Upadhyay
                   ` (5 preceding siblings ...)
  2026-02-12 16:34 ` [RFC PATCH 6/7] drm/xe: Add routine to dump allocated VRAM blocks Tejas Upadhyay
@ 2026-02-12 16:34 ` Tejas Upadhyay
  6 siblings, 0 replies; 10+ messages in thread
From: Tejas Upadhyay @ 2026-02-12 16:34 UTC (permalink / raw)
  To: intel-xe; +Cc: matthew.auld, matthew.brost, himal.prasad.ghimiray,
	Tejas Upadhyay

Starting CRI, Include a sysfs interface designed to expose information
about bad VRAM pages—those identified as having hardware faults
(e.g., ECC errors). This interface allows userspace tools and
administrators to monitor the health of the GPU's local memory and
track the status of page retirement.To get details on bad gpu vram
pages can be found under /sys/bus/pci/devices/bdf/vram_bad_pages.

Where The format is, pfn : gpu page size : flags

flags:
R: reserved, this gpu page is reserved.
P: pending for reserve, this gpu page is marked as bad, will be reserved in next window of page_reserve.
F: unable to reserve. this gpu page can’t be reserved due to some reasons.

For example if you read using cat /sys/bus/pci/devices/bdf/vram_bad_pages,
0x00000000 : 0x00001000 : R
0x00001234 : 0x00001000 : P

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_device_sysfs.c |  2 +
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 78 ++++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.h |  1 +
 3 files changed, 81 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device_sysfs.c b/drivers/gpu/drm/xe/xe_device_sysfs.c
index a73e0e957cb0..e6a017601428 100644
--- a/drivers/gpu/drm/xe/xe_device_sysfs.c
+++ b/drivers/gpu/drm/xe/xe_device_sysfs.c
@@ -14,6 +14,7 @@
 #include "xe_pcode_api.h"
 #include "xe_pcode.h"
 #include "xe_pm.h"
+#include "xe_ttm_vram_mgr.h"
 
 /**
  * DOC: Xe device sysfs
@@ -284,6 +285,7 @@ int xe_device_sysfs_init(struct xe_device *xe)
 		if (ret)
 			return ret;
 	}
+	xe_ttm_vram_sysfs_init(xe);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index cb3394000e83..c6a81ccaa9d2 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -744,3 +744,81 @@ int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr)
 	return ret;
 }
 EXPORT_SYMBOL(xe_ttm_tbo_handle_addr_fault);
+
+static void xe_ttm_vram_dump_bad_pages_info(char *buf, struct xe_ttm_vram_mgr *mgr)
+{
+	const unsigned int element_size = sizeof("0xabcdabcd : 0x12345678 : R\n") - 1;
+	struct xe_ttm_offline_resource *pos, *n;
+	struct drm_buddy_block *block;
+	ssize_t s = 0;
+
+	mutex_lock(&mgr->lock);
+	list_for_each_entry_safe(pos, n, &mgr->offlined_pages, offlined_link) {
+		block = list_first_entry(&pos->blocks,
+					 struct drm_buddy_block,
+					 link);
+		s += scnprintf(&buf[s], element_size + 1,
+			       "0x%08llx : 0x%08llx : %1s\n",
+			       drm_buddy_block_offset(block) >> PAGE_SHIFT,
+			       drm_buddy_block_size(&mgr->mm, block),
+			       "R");
+	}
+	list_for_each_entry_safe(pos, n, &mgr->queued_pages, queued_link) {
+		block = list_first_entry(&pos->blocks,
+					 struct drm_buddy_block,
+					 link);
+		s += scnprintf(&buf[s], element_size + 1,
+			       "0x%08llx : 0x%08llx : %1s\n",
+			       drm_buddy_block_offset(block) >> PAGE_SHIFT,
+			       drm_buddy_block_size(&mgr->mm, block),
+			       "P");
+	}
+	mutex_unlock(&mgr->lock);
+}
+
+static ssize_t vram_bad_pages_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct xe_device *xe = pdev_to_xe_device(pdev);
+	struct ttm_resource_manager *man;
+	u8 mem_type = XE_PL_VRAM1;
+
+	do {
+		man = ttm_manager_type(&xe->ttm, mem_type);
+		struct xe_ttm_vram_mgr *mgr = to_xe_ttm_vram_mgr(man);
+
+		if (man)
+			xe_ttm_vram_dump_bad_pages_info(buf, mgr);
+		--mem_type;
+	} while (mem_type >= XE_PL_VRAM0);
+
+	return sysfs_emit(buf, "%s\n", buf);
+}
+static DEVICE_ATTR_RO(vram_bad_pages);
+
+static void xe_ttm_vram_sysfs_fini(void *arg)
+{
+	struct xe_device *xe = arg;
+
+	device_remove_file(xe->drm.dev, &dev_attr_vram_bad_pages);
+}
+
+/**
+ * xe_ttm_vram_sysfs_init - Initialize vram sysfs component
+ * @tile: Xe Tile object
+ *
+ * It needs to be initialized after the main tile component is ready
+ *
+ * Returns: 0 on success, negative error code on error.
+ */
+int xe_ttm_vram_sysfs_init(struct xe_device *xe)
+{
+	int err;
+
+	err = device_create_file(xe->drm.dev, &dev_attr_vram_bad_pages);
+	if (err)
+		return 0;
+
+	return devm_add_action_or_reset(xe->drm.dev, xe_ttm_vram_sysfs_fini, xe);
+}
+EXPORT_SYMBOL(xe_ttm_vram_sysfs_init);
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
index 5872e8b48779..6e69140c0be8 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
@@ -34,6 +34,7 @@ void xe_ttm_vram_get_used(struct ttm_resource_manager *man,
 int xe_ttm_tbo_handle_addr_fault(struct xe_tile *tile, unsigned long addr);
 void xe_ttm_vram_dump_allocated_blocks(struct drm_device *dev, struct drm_buddy *mm,
 				       struct drm_printer *p);
+int xe_ttm_vram_sysfs_init(struct xe_device *xe);
 static inline struct xe_ttm_vram_mgr_resource *
 to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
 {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 2/7] drm/xe/svm: Use xe_vram_addr_to_region
  2026-02-12 16:34 ` [RFC PATCH 2/7] drm/xe/svm: Use xe_vram_addr_to_region Tejas Upadhyay
@ 2026-02-12 17:53   ` Matthew Auld
  2026-02-13  6:00     ` Upadhyay, Tejas
  0 siblings, 1 reply; 10+ messages in thread
From: Matthew Auld @ 2026-02-12 17:53 UTC (permalink / raw)
  To: Tejas Upadhyay, intel-xe; +Cc: matthew.brost, himal.prasad.ghimiray

On 12/02/2026 16:34, Tejas Upadhyay wrote:
> Replace the direct use of block->private with the helper function
> xe_vram_addr_to_region to get vram region.
> 
> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_svm.c | 9 ++-------
>   1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index 213f0334518a..e773456af040 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -762,7 +762,8 @@ static int xe_svm_populate_devmem_pfn(struct drm_pagemap_devmem *devmem_allocati
>   	int j = 0;
>   
>   	list_for_each_entry(block, blocks, link) {
> -		struct xe_vram_region *vr = block->private;
> +		u64 block_start = drm_buddy_block_offset(block);
> +		struct xe_vram_region *vr = xe_vram_addr_to_region(bo->tile->xe, block_start);

Oh, I think block_offset is the relative offset, which always starts 
from zero, but here you want the real addr, but that info comes from the 
region...

I was gonna say since you seem to already have bo->tile above, why not 
just pick the VRAM from that tile? Like with bo->tile->mem.vram? But I 
think bo->tile is NULL even here so above looks like it will crash?

What about looking at the bo flags or current ttm placement to figure 
out which VRAM this belongs to? There looks to already be 
res_to_mem_region() and you have res above.

>   		struct drm_buddy *buddy = vram_to_buddy(vr);
>   		u64 block_pfn = block_offset_to_pfn(devmem_allocation->dpagemap,
>   						    drm_buddy_block_offset(block));
> @@ -1033,9 +1034,7 @@ static int xe_drm_pagemap_populate_mm(struct drm_pagemap *dpagemap,
>   	struct dma_fence *pre_migrate_fence = NULL;
>   	struct xe_device *xe = vr->xe;
>   	struct device *dev = xe->drm.dev;
> -	struct drm_buddy_block *block;
>   	struct xe_validation_ctx vctx;
> -	struct list_head *blocks;
>   	struct drm_exec exec;
>   	struct xe_bo *bo;
>   	int err = 0, idx;
> @@ -1072,10 +1071,6 @@ static int xe_drm_pagemap_populate_mm(struct drm_pagemap *dpagemap,
>   					&dpagemap_devmem_ops, dpagemap, end - start,
>   					pre_migrate_fence);
>   
> -		blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)->blocks;
> -		list_for_each_entry(block, blocks, link)
> -			block->private = vr;
> -
>   		xe_bo_get(bo);
>   
>   		/* Ensure the device has a pm ref while there are device pages active. */


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [RFC PATCH 2/7] drm/xe/svm: Use xe_vram_addr_to_region
  2026-02-12 17:53   ` Matthew Auld
@ 2026-02-13  6:00     ` Upadhyay, Tejas
  0 siblings, 0 replies; 10+ messages in thread
From: Upadhyay, Tejas @ 2026-02-13  6:00 UTC (permalink / raw)
  To: Auld, Matthew, intel-xe@lists.freedesktop.org
  Cc: Brost, Matthew, Ghimiray, Himal Prasad



> -----Original Message-----
> From: Auld, Matthew <matthew.auld@intel.com>
> Sent: 12 February 2026 23:23
> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>; intel-
> xe@lists.freedesktop.org
> Cc: Brost, Matthew <matthew.brost@intel.com>; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [RFC PATCH 2/7] drm/xe/svm: Use xe_vram_addr_to_region
> 
> On 12/02/2026 16:34, Tejas Upadhyay wrote:
> > Replace the direct use of block->private with the helper function
> > xe_vram_addr_to_region to get vram region.
> >
> > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > ---
> >   drivers/gpu/drm/xe/xe_svm.c | 9 ++-------
> >   1 file changed, 2 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> > index 213f0334518a..e773456af040 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -762,7 +762,8 @@ static int xe_svm_populate_devmem_pfn(struct
> drm_pagemap_devmem *devmem_allocati
> >   	int j = 0;
> >
> >   	list_for_each_entry(block, blocks, link) {
> > -		struct xe_vram_region *vr = block->private;
> > +		u64 block_start = drm_buddy_block_offset(block);
> > +		struct xe_vram_region *vr = xe_vram_addr_to_region(bo-
> >tile->xe,
> > +block_start);
> 
> Oh, I think block_offset is the relative offset, which always starts from zero,
> but here you want the real addr, but that info comes from the region...
> 
> I was gonna say since you seem to already have bo->tile above, why not just
> pick the VRAM from that tile? Like with bo->tile->mem.vram? But I think bo-
> >tile is NULL even here so above looks like it will crash?
> 
> What about looking at the bo flags or current ttm placement to figure out
> which VRAM this belongs to? There looks to already be
> res_to_mem_region() and you have res above.

Yes looks like res_to_mem_region will work.

Thanks,
Tejas
> 
> >   		struct drm_buddy *buddy = vram_to_buddy(vr);
> >   		u64 block_pfn = block_offset_to_pfn(devmem_allocation-
> >dpagemap,
> >
> drm_buddy_block_offset(block)); @@ -1033,9 +1034,7 @@
> > static int xe_drm_pagemap_populate_mm(struct drm_pagemap
> *dpagemap,
> >   	struct dma_fence *pre_migrate_fence = NULL;
> >   	struct xe_device *xe = vr->xe;
> >   	struct device *dev = xe->drm.dev;
> > -	struct drm_buddy_block *block;
> >   	struct xe_validation_ctx vctx;
> > -	struct list_head *blocks;
> >   	struct drm_exec exec;
> >   	struct xe_bo *bo;
> >   	int err = 0, idx;
> > @@ -1072,10 +1071,6 @@ static int
> xe_drm_pagemap_populate_mm(struct drm_pagemap *dpagemap,
> >   					&dpagemap_devmem_ops,
> dpagemap, end - start,
> >   					pre_migrate_fence);
> >
> > -		blocks = &to_xe_ttm_vram_mgr_resource(bo->ttm.resource)-
> >blocks;
> > -		list_for_each_entry(block, blocks, link)
> > -			block->private = vr;
> > -
> >   		xe_bo_get(bo);
> >
> >   		/* Ensure the device has a pm ref while there are device pages
> > active. */


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-02-13  6:01 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-12 16:34 [RFC PATCH 0/7] Add memory page offlining support Tejas Upadhyay
2026-02-12 16:34 ` [RFC PATCH 1/7] drm/xe: Add a helper to get vram region from physical address Tejas Upadhyay
2026-02-12 16:34 ` [RFC PATCH 2/7] drm/xe/svm: Use xe_vram_addr_to_region Tejas Upadhyay
2026-02-12 17:53   ` Matthew Auld
2026-02-13  6:00     ` Upadhyay, Tejas
2026-02-12 16:34 ` [RFC PATCH 3/7] drm/xe: Implement VRAM object tracking ability using physical address Tejas Upadhyay
2026-02-12 16:34 ` [RFC PATCH 4/7] drm/xe: Handle physical memory address error Tejas Upadhyay
2026-02-12 16:34 ` [RFC PATCH 5/7] drm/xe/cri: Add debugfs to inject faulty vram address Tejas Upadhyay
2026-02-12 16:34 ` [RFC PATCH 6/7] drm/xe: Add routine to dump allocated VRAM blocks Tejas Upadhyay
2026-02-12 16:34 ` [RFC PATCH 7/7] [DO NOT REVIEW]drm/xe/cri: Add sysfs interface for bad gpu vram pages Tejas Upadhyay

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox