[RFC PATCH V6 0/7] Add memory page offlining support

public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed

* [RFC PATCH V6 0/7] Add memory page offlining support
@ 2026-03-27 11:48 Tejas Upadhyay
  2026-03-27 11:48 ` [RFC PATCH V6 1/7] drm/xe: Link VRAM object with gpu buddy Tejas Upadhyay
                   ` (10 more replies)
  0 siblings, 11 replies; 24+ messages in thread
From: Tejas Upadhyay @ 2026-03-27 11:48 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

This functionality represents a significant step in making
the xe driver gracefully handle hardware memory degradation.
By integrating with the DRM Buddy allocator, the driver
can permanently "carve out" faulty memory so it isn't reused
by subsequent allocations.

This series adds memory page offlining support with following:
1. drm/xe/svm: Use xe_vram_addr_to_region, avoid block->private usage
2. Link and track ttm BO's with physical addresses
3. Handle the generated physical address error by reserving addresses 4K page
4. Adds supporting debugfs to automate injection of physcal address error
5. Add buddy block allocation dump for debuggin buddy related issues
6. Add configfs for vram bad page reservation policy
7. Sysfs entry to provide statistics of bad gpu vram pages for user info

V6:
- Add more specific tests to noncritical bo sections
- Handle smooth exit of user created exec queues
- Break code and make purge specific static API
V5:
- Sysfs "max_pages" addition
- Reset block->private NULL post purge
- Remove wedge, return -EIO to system controller will initiate reset
- Add debugfs tests to trigger different test scenarios manually and via igt
- Rename addr_to_tbo to addr_to_block and move under gpu/buddy.c
V4: API reworks, add configfs for policy reservation and apply config everywhere
V3: use res_to_mem_region to avoid use of block->private (MattA)
V2:
- some fixes and clean up on errors
- Added xe_vram_addr_to_region helper to avoid other use of block->private (MattB)

Debugfs shows test of different scenarios,
echo 0 > /sys/kernel/debug/dri/bdf/invalid_addr_vram0
where 0 is below address types to be tested,
enum mempage_offline_mode {
        MEMPAGE_OFFLINE_UNALLOCATED = 0,
        MEMPAGE_OFFLINE_USER_ALLOCATED = 1,
	MEMPAGE_OFFLINE_KERNEL_USER_GGTT_ALLOCATED = 2,
	MEMPAGE_OFFLINE_KERNEL_USER_PPGTT_ALLOCATED = 3,
	MEMPAGE_OFFLINE_KERNEL_CRITICAL_ALLOCATED = 4,
	MEMPAGE_OFFLINE_RESERVED = 5,
};

IGT tests for testing this feature:
https://patchwork.freedesktop.org/patch/714751/

Results of above tests:
Using IGT_SRANDOM=1774610050 for randomisation
Opened device: /dev/dri/card0
Starting subtest: unallocated
Subtest unallocated: SUCCESS (1.834s)
Starting subtest: user-allocated
Subtest user-allocated: SUCCESS (1.832s)
Starting subtest: user-ggtt-allocated
Subtest user-ggtt-allocated: SUCCESS (1.871s)
Starting subtest: user-ppgtt-allocated
Subtest user-ppgtt-allocated: SUCCESS (1.843s)
Starting subtest: critical-allocated
Subtest critical-allocated: SUCCESS (1.824s)
Starting subtest: reserved
Subtest reserved: SUCCESS (0.032s)

Tejas Upadhyay (7):
  drm/xe: Link VRAM object with gpu buddy
  drm/gpu: Add gpu_buddy_addr_to_block helper
  drm/xe: Handle physical memory address error
  drm/xe/cri: Add debugfs to inject faulty vram address
  gpu/buddy: Add routine to dump allocated buddy blocks
  drm/xe/configfs: Add vram bad page reservation policy
  drm/xe/cri: Add sysfs interface for bad gpu vram pages

 drivers/gpu/buddy.c                        |  99 +++++
 drivers/gpu/drm/xe/xe_configfs.c           |  64 ++-
 drivers/gpu/drm/xe/xe_configfs.h           |   2 +
 drivers/gpu/drm/xe/xe_debugfs.c            | 170 ++++++++
 drivers/gpu/drm/xe/xe_device.c             |  51 +++
 drivers/gpu/drm/xe/xe_device_sysfs.c       |   7 +
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       | 427 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.h       |   2 +
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  30 ++
 include/linux/gpu_buddy.h                  |   3 +
 10 files changed, 854 insertions(+), 1 deletion(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH V6 1/7] drm/xe: Link VRAM object with gpu buddy
  2026-03-27 11:48 [RFC PATCH V6 0/7] Add memory page offlining support Tejas Upadhyay
@ 2026-03-27 11:48 ` Tejas Upadhyay
  2026-04-01 23:56   ` Matthew Brost
  2026-03-27 11:48 ` [RFC PATCH V6 2/7] drm/gpu: Add gpu_buddy_addr_to_block helper Tejas Upadhyay
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 24+ messages in thread
From: Tejas Upadhyay @ 2026-03-27 11:48 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

Setup to link TTM buffer object inside gpu buddy. This functionality
is critical for supporting the memory page offline feature on CRI,
where identified faulty pages must be traced back to their
originating buffer for safe removal.

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index 5fd0d5506a7e..c627dbf94552 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -54,6 +54,7 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager *man,
 	struct xe_ttm_vram_mgr *mgr = to_xe_ttm_vram_mgr(man);
 	struct xe_ttm_vram_mgr_resource *vres;
 	struct gpu_buddy *mm = &mgr->mm;
+	struct gpu_buddy_block *block;
 	u64 size, min_page_size;
 	unsigned long lpfn;
 	int err;
@@ -138,6 +139,8 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager *man,
 	}
 
 	mgr->visible_avail -= vres->used_visible_size;
+	list_for_each_entry(block, &vres->blocks, link)
+		block->private = tbo;
 	mutex_unlock(&mgr->lock);
 
 	if (!(vres->base.placement & TTM_PL_FLAG_CONTIGUOUS) &&
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH V6 2/7] drm/gpu: Add gpu_buddy_addr_to_block helper
  2026-03-27 11:48 [RFC PATCH V6 0/7] Add memory page offlining support Tejas Upadhyay
  2026-03-27 11:48 ` [RFC PATCH V6 1/7] drm/xe: Link VRAM object with gpu buddy Tejas Upadhyay
@ 2026-03-27 11:48 ` Tejas Upadhyay
  2026-04-02  0:09   ` Matthew Brost
  2026-04-02  9:12   ` Matthew Auld
  2026-03-27 11:48 ` [RFC PATCH V6 3/7] drm/xe: Handle physical memory address error Tejas Upadhyay
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 24+ messages in thread
From: Tejas Upadhyay @ 2026-03-27 11:48 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

Add helper with primary purpose is to efficiently trace a specific
physical memory address back to its corresponding TTM buffer object.

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/buddy.c       | 56 +++++++++++++++++++++++++++++++++++++++
 include/linux/gpu_buddy.h |  2 ++
 2 files changed, 58 insertions(+)

diff --git a/drivers/gpu/buddy.c b/drivers/gpu/buddy.c
index 52686672e99f..2d26c2a0f971 100644
--- a/drivers/gpu/buddy.c
+++ b/drivers/gpu/buddy.c
@@ -589,6 +589,62 @@ void gpu_buddy_free_block(struct gpu_buddy *mm,
 }
 EXPORT_SYMBOL(gpu_buddy_free_block);
 
+/**
+ * gpu_buddy_addr_to_block - given physical address find a block
+ *
+ * @mm: GPU buddy manager
+ * @addr: Physical address
+ *
+ * Returns:
+ * gpu_buddy_block on success, NULL or error code on failure
+ */
+struct gpu_buddy_block *gpu_buddy_addr_to_block(struct gpu_buddy *mm, u64 addr)
+{
+	struct gpu_buddy_block *block;
+	LIST_HEAD(dfs);
+	u64 end;
+	int i;
+
+	end = addr + SZ_4K - 1;
+	for (i = 0; i < mm->n_roots; ++i)
+		list_add_tail(&mm->roots[i]->tmp_link, &dfs);
+
+	do {
+		u64 block_start;
+		u64 block_end;
+
+		block = list_first_entry_or_null(&dfs,
+						 struct gpu_buddy_block,
+						 tmp_link);
+		if (!block)
+			break;
+
+		list_del(&block->tmp_link);
+
+		block_start = gpu_buddy_block_offset(block);
+		block_end = block_start + gpu_buddy_block_size(mm, block) - 1;
+
+		if (!overlaps(addr, end, block_start, block_end))
+			continue;
+
+		if (contains(addr, end, block_start, block_end) &&
+		    !gpu_buddy_block_is_split(block)) {
+			if (gpu_buddy_block_is_free(block))
+				return NULL;
+			else if (gpu_buddy_block_is_allocated(block) && !mm->clear_avail)
+				return block;
+		}
+
+		if (gpu_buddy_block_is_split(block)) {
+			list_add(&block->right->tmp_link, &dfs);
+			list_add(&block->left->tmp_link, &dfs);
+		}
+	} while (1);
+
+	return ERR_PTR(-ENXIO);
+}
+EXPORT_SYMBOL(gpu_buddy_addr_to_block);
+
 static void __gpu_buddy_free_list(struct gpu_buddy *mm,
 				  struct list_head *objects,
 				  bool mark_clear,
diff --git a/include/linux/gpu_buddy.h b/include/linux/gpu_buddy.h
index 5fa917ba5450..957c69c560bc 100644
--- a/include/linux/gpu_buddy.h
+++ b/include/linux/gpu_buddy.h
@@ -231,6 +231,8 @@ void gpu_buddy_reset_clear(struct gpu_buddy *mm, bool is_clear);
 
 void gpu_buddy_free_block(struct gpu_buddy *mm, struct gpu_buddy_block *block);
 
+struct gpu_buddy_block *gpu_buddy_addr_to_block(struct gpu_buddy *mm, u64 addr);
+
 void gpu_buddy_free_list(struct gpu_buddy *mm,
 			 struct list_head *objects,
 			 unsigned int flags);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH V6 3/7] drm/xe: Handle physical memory address error
  2026-03-27 11:48 [RFC PATCH V6 0/7] Add memory page offlining support Tejas Upadhyay
  2026-03-27 11:48 ` [RFC PATCH V6 1/7] drm/xe: Link VRAM object with gpu buddy Tejas Upadhyay
  2026-03-27 11:48 ` [RFC PATCH V6 2/7] drm/gpu: Add gpu_buddy_addr_to_block helper Tejas Upadhyay
@ 2026-03-27 11:48 ` Tejas Upadhyay
  2026-04-01 23:53   ` Matthew Brost
  2026-04-02  1:03   ` Matthew Brost
  2026-03-27 11:48 ` [RFC PATCH V6 4/7] drm/xe/cri: Add debugfs to inject faulty vram address Tejas Upadhyay
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 24+ messages in thread
From: Tejas Upadhyay @ 2026-03-27 11:48 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

This functionality represents a significant step in making
the xe driver gracefully handle hardware memory degradation.
By integrating with the DRM Buddy allocator, the driver
can permanently "carve out" faulty memory so it isn't reused
by subsequent allocations.

Buddy Block Reservation:
----------------------
When a memory address is reported as faulty, the driver instructs
the DRM Buddy allocator to reserve a block of the specific page
size (typically 4KB). This marks the memory as "dirty/used"
indefinitely.

Two-Stage Tracking:
-----------------
Offlined Pages:
Pages that have been successfully isolated and removed from the
available memory pool.

Queued Pages:
Addresses that have been flagged as faulty but are currently in
use by a process. These are tracked until the associated buffer
object (BO) is released or migrated, at which point they move
to the "offlined" state.

Sysfs Reporting:
--------------
The patch exposes these metrics through a standard interface,
allowing administrators to monitor VRAM health:
/sys/bus/pci/devices/<device_id>/vram_bad_bad_pages

V5:
- Categorise and handle BOs accordingly
- Fix crash found with new debugfs tests
V4:
- Set block->private NULL post bo purge
- Filter out gsm address early on
- Rebase
V3:
-rename api, remove tile dependency and add status of reservation
V2:
- Fix mm->avail counter issue
- Remove unused code and handle clean up in case of error

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       | 336 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.h       |   1 +
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  26 ++
 3 files changed, 363 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index c627dbf94552..0fec7b332501 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -13,7 +13,10 @@
 
 #include "xe_bo.h"
 #include "xe_device.h"
+#include "xe_exec_queue.h"
+#include "xe_lrc.h"
 #include "xe_res_cursor.h"
+#include "xe_ttm_stolen_mgr.h"
 #include "xe_ttm_vram_mgr.h"
 #include "xe_vram_types.h"
 
@@ -277,6 +280,26 @@ static const struct ttm_resource_manager_func xe_ttm_vram_mgr_func = {
 	.debug	= xe_ttm_vram_mgr_debug
 };
 
+static void xe_ttm_vram_free_bad_pages(struct drm_device *dev, struct xe_ttm_vram_mgr *mgr)
+{
+	struct xe_ttm_vram_offline_resource *pos, *n;
+
+	mutex_lock(&mgr->lock);
+	list_for_each_entry_safe(pos, n, &mgr->offlined_pages, offlined_link) {
+		--mgr->n_offlined_pages;
+		gpu_buddy_free_list(&mgr->mm, &pos->blocks, 0);
+		mgr->visible_avail += pos->used_visible_size;
+		list_del(&pos->offlined_link);
+		kfree(pos);
+	}
+	list_for_each_entry_safe(pos, n, &mgr->queued_pages, queued_link) {
+		list_del(&pos->queued_link);
+		mgr->n_queued_pages--;
+		kfree(pos);
+	}
+	mutex_unlock(&mgr->lock);
+}
+
 static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
 {
 	struct xe_device *xe = to_xe_device(dev);
@@ -288,6 +311,8 @@ static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
 	if (ttm_resource_manager_evict_all(&xe->ttm, man))
 		return;
 
+	xe_ttm_vram_free_bad_pages(dev, mgr);
+
 	WARN_ON_ONCE(mgr->visible_avail != mgr->visible_size);
 
 	gpu_buddy_fini(&mgr->mm);
@@ -316,6 +341,8 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
 	man->func = &xe_ttm_vram_mgr_func;
 	mgr->mem_type = mem_type;
 	mutex_init(&mgr->lock);
+	INIT_LIST_HEAD(&mgr->offlined_pages);
+	INIT_LIST_HEAD(&mgr->queued_pages);
 	mgr->default_page_size = default_page_size;
 	mgr->visible_size = io_size;
 	mgr->visible_avail = io_size;
@@ -471,3 +498,312 @@ u64 xe_ttm_vram_get_avail(struct ttm_resource_manager *man)
 
 	return avail;
 }
+
+static bool is_ttm_vram_migrate_lrc(struct xe_device *xe, struct xe_bo *pbo)
+{
+	if (pbo->ttm.type == ttm_bo_type_kernel &&
+	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
+	    (pbo->flags & (XE_BO_FLAG_GGTT | XE_BO_FLAG_GGTT_INVALIDATE)) &&
+	    !(pbo->flags & XE_BO_FLAG_PAGETABLE)) {
+		unsigned long idx;
+		struct xe_exec_queue *q;
+		struct drm_device *dev = &xe->drm;
+		struct drm_file *file;
+		struct xe_lrc *lrc;
+
+		/* TODO : Need to extend to multitile in future if needed */
+		mutex_lock(&dev->filelist_mutex);
+		list_for_each_entry(file, &dev->filelist, lhead) {
+			struct xe_file *xef = file->driver_priv;
+
+			mutex_lock(&xef->exec_queue.lock);
+			xa_for_each(&xef->exec_queue.xa, idx, q) {
+				xe_exec_queue_get(q);
+				mutex_unlock(&xef->exec_queue.lock);
+
+				for (int i = 0; i < q->width; i++) {
+					lrc = xe_exec_queue_get_lrc(q, i);
+					if (lrc->bo == pbo) {
+						xe_lrc_put(lrc);
+						mutex_lock(&xef->exec_queue.lock);
+						xe_exec_queue_put(q);
+						mutex_unlock(&xef->exec_queue.lock);
+						mutex_unlock(&dev->filelist_mutex);
+						return false;
+					}
+					xe_lrc_put(lrc);
+				}
+				mutex_lock(&xef->exec_queue.lock);
+				xe_exec_queue_put(q);
+				mutex_unlock(&xef->exec_queue.lock);
+			}
+		}
+		mutex_unlock(&dev->filelist_mutex);
+		return true;
+	}
+	return false;
+}
+
+static void xe_ttm_vram_purge_page(struct xe_device *xe, struct xe_bo *pbo)
+{
+	struct ttm_placement place = {};
+	struct ttm_operation_ctx ctx = {
+		.interruptible = false,
+		.gfp_retry_mayfail = false,
+	};
+	bool locked;
+	int ret = 0;
+
+	/*  Ban VM if BO is PPGTT */
+	if (pbo->ttm.type == ttm_bo_type_kernel &&
+	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
+	    pbo->flags & XE_BO_FLAG_PAGETABLE) {
+		down_write(&pbo->vm->lock);
+		xe_vm_kill(pbo->vm, true);
+		up_write(&pbo->vm->lock);
+	}
+
+	/*  Ban exec queue if BO is lrc */
+	if (pbo->ttm.type == ttm_bo_type_kernel &&
+	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
+	    (pbo->flags & (XE_BO_FLAG_GGTT | XE_BO_FLAG_GGTT_INVALIDATE)) &&
+	    !(pbo->flags & XE_BO_FLAG_PAGETABLE)) {
+		struct drm_device *dev = &xe->drm;
+		struct xe_exec_queue *q;
+		struct drm_file *file;
+		struct xe_lrc *lrc;
+		unsigned long idx;
+
+		/* TODO : Need to extend to multitile in future if needed */
+		mutex_lock(&dev->filelist_mutex);
+		list_for_each_entry(file, &dev->filelist, lhead) {
+			struct xe_file *xef = file->driver_priv;
+
+			mutex_lock(&xef->exec_queue.lock);
+			xa_for_each(&xef->exec_queue.xa, idx, q) {
+				xe_exec_queue_get(q);
+				mutex_unlock(&xef->exec_queue.lock);
+
+				for (int i = 0; i < q->width; i++) {
+					lrc = xe_exec_queue_get_lrc(q, i);
+					if (lrc->bo == pbo) {
+						xe_lrc_put(lrc);
+						xe_exec_queue_kill(q);
+					} else {
+						xe_lrc_put(lrc);
+					}
+				}
+
+				mutex_lock(&xef->exec_queue.lock);
+				xe_exec_queue_put(q);
+				mutex_unlock(&xef->exec_queue.lock);
+			}
+		}
+		mutex_unlock(&dev->filelist_mutex);
+	}
+
+	spin_lock(&pbo->ttm.bdev->lru_lock);
+	locked = dma_resv_trylock(pbo->ttm.base.resv);
+	spin_unlock(&pbo->ttm.bdev->lru_lock);
+	WARN_ON(!locked);
+	ret = ttm_bo_validate(&pbo->ttm, &place, &ctx);
+	drm_WARN_ON(&xe->drm, ret);
+	xe_bo_put(pbo);
+	if (locked)
+		dma_resv_unlock(pbo->ttm.base.resv);
+}
+
+static int xe_ttm_vram_reserve_page_at_addr(struct xe_device *xe, unsigned long addr,
+					    struct xe_ttm_vram_mgr *vram_mgr, struct gpu_buddy *mm)
+{
+	struct xe_ttm_vram_offline_resource *nentry;
+	struct ttm_buffer_object *tbo = NULL;
+	struct gpu_buddy_block *block;
+	struct gpu_buddy_block *b, *m;
+	enum reserve_status {
+		pending = 0,
+		fail
+	};
+	u64 size = SZ_4K;
+	int ret = 0;
+
+	mutex_lock(&vram_mgr->lock);
+	block = gpu_buddy_addr_to_block(mm, addr);
+	if (PTR_ERR(block) == -ENXIO) {
+		mutex_unlock(&vram_mgr->lock);
+		return -ENXIO;
+	}
+
+	nentry = kzalloc_obj(*nentry);
+	if (!nentry)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&nentry->blocks);
+	nentry->status = pending;
+
+	if (block) {
+		struct xe_ttm_vram_offline_resource *pos, *n;
+		struct xe_bo *pbo;
+
+		WARN_ON(!block->private);
+		tbo = block->private;
+		pbo = ttm_to_xe_bo(tbo);
+
+		xe_bo_get(pbo);
+		/* Critical kernel BO? */
+		if (pbo->ttm.type == ttm_bo_type_kernel &&
+		    (!(pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM) ||
+		     is_ttm_vram_migrate_lrc(xe, pbo))) {
+			mutex_unlock(&vram_mgr->lock);
+			kfree(nentry);
+			xe_ttm_vram_free_bad_pages(&xe->drm, vram_mgr);
+			xe_bo_put(pbo);
+			drm_err(&xe->drm,
+				"%s: corrupt addr: 0x%lx in critical kernel bo, request reset\n",
+				__func__, addr);
+			/* Hint System controller driver for reset with -EIO  */
+			return -EIO;
+		}
+		nentry->id = ++vram_mgr->n_queued_pages;
+		list_add(&nentry->queued_link, &vram_mgr->queued_pages);
+		mutex_unlock(&vram_mgr->lock);
+
+		/* Purge BO containing address */
+		 xe_ttm_vram_purge_page(xe, pbo);
+
+		/* Reserve page at address addr*/
+		mutex_lock(&vram_mgr->lock);
+		ret = gpu_buddy_alloc_blocks(mm, addr, addr + size,
+					     size, size, &nentry->blocks,
+					     GPU_BUDDY_RANGE_ALLOCATION);
+
+		if (ret) {
+			drm_warn(&xe->drm, "Could not reserve page at addr:0x%lx, ret:%d\n",
+				 addr, ret);
+			nentry->status = fail;
+			mutex_unlock(&vram_mgr->lock);
+			return ret;
+		}
+
+		list_for_each_entry_safe(b, m, &nentry->blocks, link)
+			b->private = NULL;
+
+		if ((addr + size) <= vram_mgr->visible_size) {
+			nentry->used_visible_size = size;
+		} else {
+			list_for_each_entry(b, &nentry->blocks, link) {
+				u64 start = gpu_buddy_block_offset(b);
+
+				if (start < vram_mgr->visible_size) {
+					u64 end = start + gpu_buddy_block_size(mm, b);
+
+					nentry->used_visible_size +=
+						min(end, vram_mgr->visible_size) - start;
+				}
+			}
+		}
+		vram_mgr->visible_avail -= nentry->used_visible_size;
+		list_for_each_entry_safe(pos, n, &vram_mgr->queued_pages, queued_link) {
+			if (pos->id == nentry->id) {
+				--vram_mgr->n_queued_pages;
+				list_del(&pos->queued_link);
+				break;
+			}
+		}
+		list_add(&nentry->offlined_link, &vram_mgr->offlined_pages);
+		/* TODO: FW Integration: Send command to FW for offlining page */
+		++vram_mgr->n_offlined_pages;
+		mutex_unlock(&vram_mgr->lock);
+		return ret;
+
+	} else {
+		ret = gpu_buddy_alloc_blocks(mm, addr, addr + size,
+					     size, size, &nentry->blocks,
+					     GPU_BUDDY_RANGE_ALLOCATION);
+		if (ret) {
+			drm_warn(&xe->drm, "Could not reserve page at addr:0x%lx, ret:%d\n",
+				 addr, ret);
+			nentry->status = fail;
+			mutex_unlock(&vram_mgr->lock);
+			return ret;
+		}
+
+		list_for_each_entry_safe(b, m, &nentry->blocks, link)
+			b->private = NULL;
+
+		if ((addr + size) <= vram_mgr->visible_size) {
+			nentry->used_visible_size = size;
+		} else {
+			struct gpu_buddy_block *block;
+
+			list_for_each_entry(block, &nentry->blocks, link) {
+				u64 start = gpu_buddy_block_offset(block);
+
+				if (start < vram_mgr->visible_size) {
+					u64 end = start + gpu_buddy_block_size(mm, block);
+
+					nentry->used_visible_size +=
+						min(end, vram_mgr->visible_size) - start;
+				}
+			}
+		}
+		vram_mgr->visible_avail -= nentry->used_visible_size;
+		nentry->id = ++vram_mgr->n_offlined_pages;
+		list_add(&nentry->offlined_link, &vram_mgr->offlined_pages);
+		/* TODO: FW Integration: Send command to FW for offlining page */
+		mutex_unlock(&vram_mgr->lock);
+	}
+	/* Success */
+	return ret;
+}
+
+static struct xe_vram_region *xe_ttm_vram_addr_to_region(struct xe_device *xe,
+							 resource_size_t addr)
+{
+	unsigned long stolen_base = xe_ttm_stolen_gpu_offset(xe);
+	struct xe_vram_region *vr;
+	struct xe_tile *tile;
+	int id;
+
+	/* Addr from stolen memory? */
+	if (addr + SZ_4K >= stolen_base)
+		return NULL;
+
+	for_each_tile(tile, xe, id) {
+		vr = tile->mem.vram;
+		if ((addr <= vr->dpa_base + vr->actual_physical_size) &&
+		    (addr + SZ_4K >= vr->dpa_base))
+			return vr;
+	}
+	return NULL;
+}
+
+/**
+ * xe_ttm_vram_handle_addr_fault - Handle vram physical address error flaged
+ * @xe: pointer to parent device
+ * @addr: physical faulty address
+ *
+ * Handle the physcial faulty address error on specific tile.
+ *
+ * Returns 0 for success, negative error code otherwise.
+ */
+int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long addr)
+{
+	struct xe_ttm_vram_mgr *vram_mgr;
+	struct xe_vram_region *vr;
+	struct gpu_buddy *mm;
+	int ret;
+
+	vr = xe_ttm_vram_addr_to_region(xe, addr);
+	if (!vr) {
+		drm_err(&xe->drm, "%s:%d addr:%lx error requesting SBR\n",
+			__func__, __LINE__, addr);
+		/* Hint System controller driver for reset with -EIO  */
+		return -EIO;
+	}
+	vram_mgr = &vr->ttm;
+	mm = &vram_mgr->mm;
+	/* Reserve page at address */
+	ret = xe_ttm_vram_reserve_page_at_addr(xe, addr, vram_mgr, mm);
+	return ret;
+}
+EXPORT_SYMBOL(xe_ttm_vram_handle_addr_fault);
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
index 87b7fae5edba..8ef06d9d44f7 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
@@ -31,6 +31,7 @@ u64 xe_ttm_vram_get_cpu_visible_size(struct ttm_resource_manager *man);
 void xe_ttm_vram_get_used(struct ttm_resource_manager *man,
 			  u64 *used, u64 *used_visible);
 
+int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long addr);
 static inline struct xe_ttm_vram_mgr_resource *
 to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
 {
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
index 9106da056b49..94eaf9d875f1 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
@@ -19,6 +19,14 @@ struct xe_ttm_vram_mgr {
 	struct ttm_resource_manager manager;
 	/** @mm: DRM buddy allocator which manages the VRAM */
 	struct gpu_buddy mm;
+	/** @offlined_pages: List of offlined pages */
+	struct list_head offlined_pages;
+	/** @n_offlined_pages: Number of offlined pages */
+	u16 n_offlined_pages;
+	/** @queued_pages: List of queued pages */
+	struct list_head queued_pages;
+	/** @n_queued_pages: Number of queued pages */
+	u16 n_queued_pages;
 	/** @visible_size: Proped size of the CPU visible portion */
 	u64 visible_size;
 	/** @visible_avail: CPU visible portion still unallocated */
@@ -45,4 +53,22 @@ struct xe_ttm_vram_mgr_resource {
 	unsigned long flags;
 };
 
+/**
+ * struct xe_ttm_vram_offline_resource - Xe TTM VRAM offline  resource
+ */
+struct xe_ttm_vram_offline_resource {
+	/** @offlined_link: Link to offlined pages */
+	struct list_head offlined_link;
+	/** @queued_link: Link to queued pages */
+	struct list_head queued_link;
+	/** @blocks: list of DRM buddy blocks */
+	struct list_head blocks;
+	/** @used_visible_size: How many CPU visible bytes this resource is using */
+	u64 used_visible_size;
+	/** @id: The id of an offline resource */
+	u16 id;
+	/** @status: reservation status of resource */
+	bool status;
+};
+
 #endif
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH V6 4/7] drm/xe/cri: Add debugfs to inject faulty vram address
  2026-03-27 11:48 [RFC PATCH V6 0/7] Add memory page offlining support Tejas Upadhyay
                   ` (2 preceding siblings ...)
  2026-03-27 11:48 ` [RFC PATCH V6 3/7] drm/xe: Handle physical memory address error Tejas Upadhyay
@ 2026-03-27 11:48 ` Tejas Upadhyay
  2026-03-27 11:48 ` [RFC PATCH V6 5/7] gpu/buddy: Add routine to dump allocated buddy blocks Tejas Upadhyay
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Tejas Upadhyay @ 2026-03-27 11:48 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

Add debugfs which can help testing feature with manual error injection.
Adding a debugfs interface to the drm/xe driver allows manual injection
of faulty VRAM addresses, facilitating the testing of the CRI memory
page offline feature before it is fully functional. The implementation
involves creating a debugfs entry, likely under
/sys/kernel/debug/dri/bdf/invalid_addr_vram0,
to accept specific faulty addresses for validation.

For example,
echo 0 > /sys/kernel/debug/dri/bdf/invalid_addr_vram0
where 0 is below address types to be tested,
enum mempage_offline_mode {
        MEMPAGE_OFFLINE_UNALLOCATED = 0,
        MEMPAGE_OFFLINE_USER_ALLOCATED = 1,
	MEMPAGE_OFFLINE_KERNEL_USER_GGTT_ALLOCATED = 2,
	MEMPAGE_OFFLINE_KERNEL_USER_PPGTT_ALLOCATED = 3,
	MEMPAGE_OFFLINE_KERNEL_CRITICAL_ALLOCATED = 4,
	MEMPAGE_OFFLINE_RESERVED = 5
};

v3:
- Add more specific noncritical bo tests
v2:
- Add mode based automated test vs manual address feed

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_debugfs.c            | 170 +++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |   2 +
 2 files changed, 172 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_debugfs.c b/drivers/gpu/drm/xe/xe_debugfs.c
index 844cfafe1ec7..aaa779d8b2e2 100644
--- a/drivers/gpu/drm/xe/xe_debugfs.c
+++ b/drivers/gpu/drm/xe/xe_debugfs.c
@@ -27,6 +27,8 @@
 #include "xe_sriov_vf.h"
 #include "xe_step.h"
 #include "xe_tile_debugfs.h"
+#include "xe_ttm_stolen_mgr.h"
+#include "xe_ttm_vram_mgr.h"
 #include "xe_vsec.h"
 #include "xe_wa.h"
 
@@ -38,6 +40,14 @@
 
 DECLARE_FAULT_ATTR(gt_reset_failure);
 DECLARE_FAULT_ATTR(inject_csc_hw_error);
+enum mempage_offline_mode {
+	MEMPAGE_OFFLINE_UNALLOCATED = 0,
+	MEMPAGE_OFFLINE_USER_ALLOCATED = 1,
+	MEMPAGE_OFFLINE_KERNEL_USER_GGTT_ALLOCATED = 2,
+	MEMPAGE_OFFLINE_KERNEL_USER_PPGTT_ALLOCATED = 3,
+	MEMPAGE_OFFLINE_KERNEL_CRITICAL_ALLOCATED = 4,
+	MEMPAGE_OFFLINE_RESERVED = 5,
+};
 
 static void read_residency_counter(struct xe_device *xe, struct xe_mmio *mmio,
 				   u32 offset, const char *name, struct drm_printer *p)
@@ -509,6 +519,155 @@ static const struct file_operations disable_late_binding_fops = {
 	.write = disable_late_binding_set,
 };
 
+static ssize_t addr_fault_reporting_show(struct file *f, char __user *ubuf,
+					 size_t size, loff_t *pos)
+{
+	struct xe_device *xe = file_inode(f)->i_private;
+	char buf[32];
+	int len;
+
+	len = scnprintf(buf, sizeof(buf), "%lld\n", xe->mem.vram->ttm.offline_mode);
+
+	return simple_read_from_buffer(ubuf, size, pos, buf, len);
+}
+
+static int mempage_exec_offline(struct xe_device *xe, u64 mode)
+{
+	struct xe_vram_region *vr = xe_device_get_root_tile(xe)->mem.vram;
+	struct ttm_buffer_object *tbo = NULL;
+	struct xe_ttm_vram_mgr *vram_mgr;
+	struct gpu_buddy_block *block;
+	bool do_offline = false;
+	struct gpu_buddy *mm;
+	struct xe_bo *bo;
+	u64 addr = 0x0;
+	int ret = 0;
+
+	vram_mgr = &vr->ttm;
+	mm = &vram_mgr->mm;
+	addr = vr->dpa_base;
+	while (addr <= vr->dpa_base + vr->actual_physical_size) {
+		mutex_lock(&vram_mgr->lock);
+		block = gpu_buddy_addr_to_block(mm, addr);
+		if (!block && mode == MEMPAGE_OFFLINE_UNALLOCATED)
+			do_offline = true;
+		if (block && PTR_ERR(block) != -ENXIO) {
+			if (!block->private) {
+				mutex_unlock(&vram_mgr->lock);
+				addr = addr + SZ_4K;
+				do_offline = false;
+				continue;
+			}
+			tbo = block->private;
+			bo = ttm_to_xe_bo(tbo);
+			if (bo->ttm.type == ttm_bo_type_device &&
+			    bo->flags & XE_BO_FLAG_USER &&
+			    bo->flags & XE_BO_FLAG_VRAM_MASK &&
+			    /* !lrc */
+			    !(bo->flags & (XE_BO_FLAG_GGTT | XE_BO_FLAG_GGTT_INVALIDATE)) &&
+			    /* !ppgtt */
+			    !(bo->flags & XE_BO_FLAG_PAGETABLE) &&
+			      mode == MEMPAGE_OFFLINE_USER_ALLOCATED) {
+				do_offline = true;
+			} else if (bo->ttm.type == ttm_bo_type_kernel &&
+				    bo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
+				    (bo->flags & (XE_BO_FLAG_GGTT | XE_BO_FLAG_GGTT_INVALIDATE)) &&
+				    !(bo->flags & XE_BO_FLAG_PAGETABLE) &&
+				    mode == MEMPAGE_OFFLINE_KERNEL_USER_GGTT_ALLOCATED) {
+				/* lrc */
+				do_offline = true;
+			} else if (bo->ttm.type == ttm_bo_type_kernel &&
+				   bo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
+				   !(bo->flags & (XE_BO_FLAG_GGTT | XE_BO_FLAG_GGTT_INVALIDATE)) &&
+				   bo->flags & XE_BO_FLAG_PAGETABLE &&
+				   mode == MEMPAGE_OFFLINE_KERNEL_USER_PPGTT_ALLOCATED) {
+				/* ppgtt */
+				do_offline = true;
+			} else if (bo->ttm.type == ttm_bo_type_kernel &&
+				   !(bo->flags & XE_BO_FLAG_FORCE_USER_VRAM) &&
+				     mode == MEMPAGE_OFFLINE_KERNEL_CRITICAL_ALLOCATED) {
+				do_offline = true;
+			}
+		}
+		if (do_offline) {
+			mutex_unlock(&vram_mgr->lock);
+			/* Report fault */
+			ret = xe_ttm_vram_handle_addr_fault(xe, addr);
+			if (ret) {
+				if ((ret == -EIO) &&
+				    mode == MEMPAGE_OFFLINE_KERNEL_USER_GGTT_ALLOCATED) {
+					addr = addr + SZ_4K;
+					if (do_offline)
+						do_offline = false;
+					continue;
+				}
+				break;
+			}
+			/* Verify addr + SZ_4K is allocated */
+			mutex_lock(&vram_mgr->lock);
+			block = gpu_buddy_addr_to_block(mm, addr);
+			if (!block || PTR_ERR(block) == -ENXIO || block->private)
+				ret = -EBUSY;
+			mutex_unlock(&vram_mgr->lock);
+			break;
+		}
+		mutex_unlock(&vram_mgr->lock);
+		addr = addr + SZ_4K;
+		if (do_offline)
+			do_offline = false;
+	}
+	return ret;
+}
+
+static ssize_t addr_fault_reporting_set(struct file *f, const char __user *ubuf,
+					size_t size, loff_t *pos)
+{
+	struct xe_device *xe = file_inode(f)->i_private;
+	int ret = 0;
+	u64 mode;
+
+	ret = kstrtou64_from_user(ubuf, size, 0, &mode);
+	if (ret)
+		return ret;
+
+	switch (mode) {
+	case MEMPAGE_OFFLINE_UNALLOCATED:
+	case MEMPAGE_OFFLINE_USER_ALLOCATED:
+	case MEMPAGE_OFFLINE_KERNEL_USER_GGTT_ALLOCATED:
+	case MEMPAGE_OFFLINE_KERNEL_USER_PPGTT_ALLOCATED:
+	case MEMPAGE_OFFLINE_KERNEL_CRITICAL_ALLOCATED:
+		ret = mempage_exec_offline(xe, mode);
+		break;
+	case MEMPAGE_OFFLINE_RESERVED:
+		u64 stolen_base;
+
+		stolen_base = xe_ttm_stolen_gpu_offset(xe);
+		ret = xe_ttm_vram_handle_addr_fault(xe, stolen_base);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	xe->mem.vram->ttm.offline_mode = mode;
+	if (!ret || (ret == -EIO &&
+		     (mode == MEMPAGE_OFFLINE_KERNEL_CRITICAL_ALLOCATED ||
+		      mode == MEMPAGE_OFFLINE_RESERVED))) {
+		drm_info(&xe->drm, "offline mode %llu passed\n", mode);
+	} else {
+		drm_warn(&xe->drm, "offline mode %llu failed, ret:%d\n", mode, ret);
+		return ret;
+	}
+
+	return size;
+}
+
+static const struct file_operations addr_fault_reporting_fops = {
+	.owner = THIS_MODULE,
+	.read = addr_fault_reporting_show,
+	.write = addr_fault_reporting_set,
+};
+
 void xe_debugfs_register(struct xe_device *xe)
 {
 	struct ttm_device *bdev = &xe->ttm;
@@ -565,6 +724,17 @@ void xe_debugfs_register(struct xe_device *xe)
 	if (man)
 		ttm_resource_manager_create_debugfs(man, root, "stolen_mm");
 
+	if (xe->info.platform == XE_CRESCENTISLAND) {
+		man = ttm_manager_type(bdev, XE_PL_VRAM0);
+		if (man) {
+			char name[20];
+
+			snprintf(name, sizeof(name), "invalid_addr_vram%d", 0);
+			debugfs_create_file(name, 0600, root, xe,
+					    &addr_fault_reporting_fops);
+		}
+	}
+
 	for_each_tile(tile, xe, tile_id)
 		xe_tile_debugfs_register(tile);
 
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
index 94eaf9d875f1..65245668c183 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
@@ -37,6 +37,8 @@ struct xe_ttm_vram_mgr {
 	struct mutex lock;
 	/** @mem_type: The TTM memory type */
 	u32 mem_type;
+	/** @offline_mode: debugfs hook for setting page offline mode */
+	u64 offline_mode;
 };
 
 /**
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH V6 5/7] gpu/buddy: Add routine to dump allocated buddy blocks
  2026-03-27 11:48 [RFC PATCH V6 0/7] Add memory page offlining support Tejas Upadhyay
                   ` (3 preceding siblings ...)
  2026-03-27 11:48 ` [RFC PATCH V6 4/7] drm/xe/cri: Add debugfs to inject faulty vram address Tejas Upadhyay
@ 2026-03-27 11:48 ` Tejas Upadhyay
  2026-03-27 11:48 ` [RFC PATCH V6 6/7] drm/xe/configfs: Add vram bad page reservation policy Tejas Upadhyay
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Tejas Upadhyay @ 2026-03-27 11:48 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

To implement the ability to see allocated blocks under a specific VRAM
instance in the drm driver, new api is introduced. While existing structs
often show the free block list, this addition provides a comprehensive view
of all currently resident VRAM allocations.

Dump will look like,

[  +0.000003] xe 0000:03:00.0: [drm] 0x00000002f8000000-0x00000002f8800000: 8388608
[  +0.000005] xe 0000:03:00.0: [drm] 0x00000002f8800000-0x00000002f8840000: 262144
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f8840000-0x00000002f8860000: 131072
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f8860000-0x00000002f8870000: 65536
[  +0.000005] xe 0000:03:00.0: [drm] 0x00000002f9000000-0x00000002f9800000: 8388608
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f9800000-0x00000002f9880000: 524288
[  +0.000005] xe 0000:03:00.0: [drm] 0x00000002f9880000-0x00000002f9884000: 16384
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f9900000-0x00000002f9980000: 524288
[  +0.000005] xe 0000:03:00.0: [drm] 0x00000002f9980000-0x00000002f9988000: 32768
[  +0.000004] xe 0000:03:00.0: [drm] 0x00000002f9988000-0x00000002f998c000: 16384

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/buddy.c       | 43 +++++++++++++++++++++++++++++++++++++++
 include/linux/gpu_buddy.h |  1 +
 2 files changed, 44 insertions(+)

diff --git a/drivers/gpu/buddy.c b/drivers/gpu/buddy.c
index 2d26c2a0f971..3e3f0dbbbda6 100644
--- a/drivers/gpu/buddy.c
+++ b/drivers/gpu/buddy.c
@@ -10,6 +10,7 @@
 #include <linux/sizes.h>
 
 #include <linux/gpu_buddy.h>
+#include <drm/drm_print.h>
 
 /**
  * gpu_buddy_assert - assert a condition in the buddy allocator
@@ -1289,6 +1290,48 @@ int gpu_buddy_block_trim(struct gpu_buddy *mm,
 }
 EXPORT_SYMBOL(gpu_buddy_block_trim);
 
+/**
+ * gpu_buddy_dump_allocated_blocks - print all allocated blocks in drm buddy
+ *
+ * @dev: drm device
+ * @mm: DRM buddy manager to look into
+ * @p: drm printer to print info
+ *
+ * Looks into buddy manager for each block and their status and if allocated
+ * print allocated block range and size
+ *
+ * Returns:
+ * void
+ */
+void gpu_buddy_dump_allocated_blocks(struct gpu_buddy *mm)
+{
+	struct gpu_buddy_block *block;
+	LIST_HEAD(dfs);
+	int i;
+
+	for (i = 0; i < mm->n_roots; ++i)
+		list_add_tail(&mm->roots[i]->tmp_link, &dfs);
+
+	do {
+		block = list_first_entry_or_null(&dfs,
+						 struct gpu_buddy_block,
+						 tmp_link);
+		if (!block)
+			break;
+
+		list_del(&block->tmp_link);
+
+		if (gpu_buddy_block_is_allocated(block))
+			gpu_buddy_block_print(mm, block);
+
+		if (gpu_buddy_block_is_split(block)) {
+			list_add(&block->right->tmp_link, &dfs);
+			list_add(&block->left->tmp_link, &dfs);
+		}
+	} while (1);
+}
+EXPORT_SYMBOL(gpu_buddy_dump_allocated_blocks);
+
 static struct gpu_buddy_block *
 __gpu_buddy_alloc_blocks(struct gpu_buddy *mm,
 			 u64 start, u64 end,
diff --git a/include/linux/gpu_buddy.h b/include/linux/gpu_buddy.h
index 957c69c560bc..0a09603fa8b6 100644
--- a/include/linux/gpu_buddy.h
+++ b/include/linux/gpu_buddy.h
@@ -226,6 +226,7 @@ int gpu_buddy_block_trim(struct gpu_buddy *mm,
 			 u64 *start,
 			 u64 new_size,
 			 struct list_head *blocks);
+void gpu_buddy_dump_allocated_blocks(struct gpu_buddy *mm);
 
 void gpu_buddy_reset_clear(struct gpu_buddy *mm, bool is_clear);
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH V6 6/7] drm/xe/configfs: Add vram bad page reservation policy
  2026-03-27 11:48 [RFC PATCH V6 0/7] Add memory page offlining support Tejas Upadhyay
                   ` (4 preceding siblings ...)
  2026-03-27 11:48 ` [RFC PATCH V6 5/7] gpu/buddy: Add routine to dump allocated buddy blocks Tejas Upadhyay
@ 2026-03-27 11:48 ` Tejas Upadhyay
  2026-03-27 11:48 ` [RFC PATCH V6 7/7] drm/xe/cri: Add sysfs interface for bad gpu vram pages Tejas Upadhyay
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Tejas Upadhyay @ 2026-03-27 11:48 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

The interface enables setting the policy for how bad pages are
handled in VRAM. This is crucial for maintaining system
stability in scenarios where VRAM degradation occurs.

By default policy will be "reserve", which can be changed to
"logging" only.

v2:
- Add CRI check and rebase

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_configfs.c     | 64 +++++++++++++++++++++++++++-
 drivers/gpu/drm/xe/xe_configfs.h     |  2 +
 drivers/gpu/drm/xe/xe_device.c       | 44 +++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 11 +++++
 4 files changed, 120 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_configfs.c b/drivers/gpu/drm/xe/xe_configfs.c
index 32102600a148..4ac5274a4090 100644
--- a/drivers/gpu/drm/xe/xe_configfs.c
+++ b/drivers/gpu/drm/xe/xe_configfs.c
@@ -61,7 +61,8 @@
  *	    ├── survivability_mode
  *	    ├── gt_types_allowed
  *	    ├── engines_allowed
- *	    └── enable_psmi
+ *          ├── enable_psmi
+ *          └── bad_page_reservation
  *
  * After configuring the attributes as per next section, the device can be
  * probed with::
@@ -159,6 +160,16 @@
  *
  * This attribute can only be set before binding to the device.
  *
+ * Bad pages reservation:
+ * ---------------------
+ *
+ * Disale vram bad pages reservation, instead just report it in dmesg.
+ *  Example to disable it::
+ *
+ *      # echo 0 > /sys/kernel/config/xe/0000:03:00.0/bad_page_reservation
+ *
+ * This attribute can only be set before binding to the device.
+ *
  * Context restore BB
  * ------------------
  *
@@ -262,6 +273,7 @@ struct xe_config_group_device {
 		struct wa_bb ctx_restore_mid_bb[XE_ENGINE_CLASS_MAX];
 		bool survivability_mode;
 		bool enable_psmi;
+		bool bad_page_reservation;
 		struct {
 			unsigned int max_vfs;
 			bool admin_only_pf;
@@ -281,6 +293,7 @@ static const struct xe_config_device device_defaults = {
 	.engines_allowed = U64_MAX,
 	.survivability_mode = false,
 	.enable_psmi = false,
+	.bad_page_reservation = true,
 	.sriov = {
 		.max_vfs = XE_DEFAULT_MAX_VFS,
 		.admin_only_pf = XE_DEFAULT_ADMIN_ONLY_PF,
@@ -575,6 +588,32 @@ static ssize_t enable_psmi_store(struct config_item *item, const char *page, siz
 	return len;
 }
 
+static ssize_t bad_page_reservation_show(struct config_item *item, char *page)
+{
+	struct xe_config_device *dev = to_xe_config_device(item);
+
+	return sprintf(page, "%d\n", dev->bad_page_reservation);
+}
+
+static ssize_t bad_page_reservation_store(struct config_item *item, const char *page, size_t len)
+{
+	struct xe_config_group_device *dev = to_xe_config_group_device(item);
+	bool val;
+	int ret;
+
+	ret = kstrtobool(page, &val);
+	if (ret)
+		return ret;
+
+	guard(mutex)(&dev->lock);
+	if (is_bound(dev))
+		return -EBUSY;
+
+	dev->config.bad_page_reservation = val;
+
+	return len;
+}
+
 static bool wa_bb_read_advance(bool dereference, char **p,
 			       const char *append, size_t len,
 			       size_t *max_size)
@@ -813,6 +852,7 @@ static ssize_t ctx_restore_post_bb_store(struct config_item *item,
 CONFIGFS_ATTR(, ctx_restore_mid_bb);
 CONFIGFS_ATTR(, ctx_restore_post_bb);
 CONFIGFS_ATTR(, enable_psmi);
+CONFIGFS_ATTR(, bad_page_reservation);
 CONFIGFS_ATTR(, engines_allowed);
 CONFIGFS_ATTR(, gt_types_allowed);
 CONFIGFS_ATTR(, survivability_mode);
@@ -821,6 +861,7 @@ static struct configfs_attribute *xe_config_device_attrs[] = {
 	&attr_ctx_restore_mid_bb,
 	&attr_ctx_restore_post_bb,
 	&attr_enable_psmi,
+	&attr_bad_page_reservation,
 	&attr_engines_allowed,
 	&attr_gt_types_allowed,
 	&attr_survivability_mode,
@@ -1098,6 +1139,7 @@ static void dump_custom_dev_config(struct pci_dev *pdev,
 	PRI_CUSTOM_ATTR("%llx", gt_types_allowed);
 	PRI_CUSTOM_ATTR("%llx", engines_allowed);
 	PRI_CUSTOM_ATTR("%d", enable_psmi);
+	PRI_CUSTOM_ATTR("%d", bad_page_reservation);
 	PRI_CUSTOM_ATTR("%d", survivability_mode);
 	PRI_CUSTOM_ATTR("%u", sriov.admin_only_pf);
 
@@ -1225,6 +1267,26 @@ bool xe_configfs_get_psmi_enabled(struct pci_dev *pdev)
 	return ret;
 }
 
+/**
+ * xe_configfs_get_bad_page_reservation - get configfs bad_page_reservation setting
+ * @pdev: pci device
+ *
+ * Return: bad_page_reservation setting in configfs
+ */
+bool xe_configfs_get_bad_page_reservation(struct pci_dev *pdev)
+{
+	struct xe_config_group_device *dev = find_xe_config_group_device(pdev);
+	bool ret;
+
+	if (!dev)
+		return device_defaults.bad_page_reservation;
+
+	ret = dev->config.bad_page_reservation;
+	config_group_put(&dev->group);
+
+	return ret;
+}
+
 /**
  * xe_configfs_get_ctx_restore_mid_bb - get configfs ctx_restore_mid_bb setting
  * @pdev: pci device
diff --git a/drivers/gpu/drm/xe/xe_configfs.h b/drivers/gpu/drm/xe/xe_configfs.h
index 07d62bf0c152..c107d84b2c62 100644
--- a/drivers/gpu/drm/xe/xe_configfs.h
+++ b/drivers/gpu/drm/xe/xe_configfs.h
@@ -23,6 +23,7 @@ bool xe_configfs_primary_gt_allowed(struct pci_dev *pdev);
 bool xe_configfs_media_gt_allowed(struct pci_dev *pdev);
 u64 xe_configfs_get_engines_allowed(struct pci_dev *pdev);
 bool xe_configfs_get_psmi_enabled(struct pci_dev *pdev);
+bool xe_configfs_get_bad_page_reservation(struct pci_dev *pdev);
 u32 xe_configfs_get_ctx_restore_mid_bb(struct pci_dev *pdev,
 				       enum xe_engine_class class,
 				       const u32 **cs);
@@ -42,6 +43,7 @@ static inline bool xe_configfs_primary_gt_allowed(struct pci_dev *pdev) { return
 static inline bool xe_configfs_media_gt_allowed(struct pci_dev *pdev) { return true; }
 static inline u64 xe_configfs_get_engines_allowed(struct pci_dev *pdev) { return U64_MAX; }
 static inline bool xe_configfs_get_psmi_enabled(struct pci_dev *pdev) { return false; }
+static inline bool xe_configfs_get_bad_page_reservation(struct pci_dev *pdev) { return true; }
 static inline u32 xe_configfs_get_ctx_restore_mid_bb(struct pci_dev *pdev,
 						     enum xe_engine_class class,
 						     const u32 **cs) { return 0; }
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 44d04ac0951a..795256f9fdc7 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -25,6 +25,7 @@
 #include "regs/xe_regs.h"
 #include "xe_bo.h"
 #include "xe_bo_evict.h"
+#include "xe_configfs.h"
 #include "xe_debugfs.h"
 #include "xe_defaults.h"
 #include "xe_devcoredump.h"
@@ -68,6 +69,7 @@
 #include "xe_tile.h"
 #include "xe_ttm_stolen_mgr.h"
 #include "xe_ttm_sys_mgr.h"
+#include "xe_ttm_vram_mgr.h"
 #include "xe_vm.h"
 #include "xe_vm_madvise.h"
 #include "xe_vram.h"
@@ -833,6 +835,44 @@ static void detect_preproduction_hw(struct xe_device *xe)
 	}
 }
 
+static int xe_device_process_bad_pages(struct xe_device *xe)
+{
+	unsigned long offlined[1] = {0x0};
+	unsigned long queued[1] = {0x3000};
+	int n_bad_pages = ARRAY_SIZE(offlined) + ARRAY_SIZE(queued);
+	unsigned long *bad_pages;
+	bool policy;
+	u8 i;
+
+	if (xe->info.platform != XE_CRESCENTISLAND)
+		return 0;
+
+	/* TODO: FW Integration: Query FW for offline/queued pages */
+
+	if (!n_bad_pages)
+		return 0;
+	bad_pages = kmalloc_array(n_bad_pages, sizeof(unsigned long), GFP_KERNEL);
+	if (!bad_pages)
+		return -ENOMEM;
+
+	for (int i = 0; i < ARRAY_SIZE(offlined); i++)
+		bad_pages[i] = offlined[i];
+	for (int i = 0; i < ARRAY_SIZE(queued); i++)
+		bad_pages[ARRAY_SIZE(offlined) + i] = queued[i];
+
+	/* Read policy from configfs */
+	policy = xe_configfs_get_bad_page_reservation(to_pci_dev(xe->drm.dev));
+	for (i = 0; i < n_bad_pages; i++) {
+		if (!policy)
+			drm_err(&xe->drm, "0x%lx is reported as corrupted address by HW\n",
+				bad_pages[i]);
+		else
+			xe_ttm_vram_handle_addr_fault(xe, bad_pages[i]);
+	}
+	kfree(bad_pages);
+	return 0;
+}
+
 int xe_device_probe(struct xe_device *xe)
 {
 	struct xe_tile *tile;
@@ -902,6 +942,10 @@ int xe_device_probe(struct xe_device *xe)
 	if (err)
 		return err;
 
+	err = xe_device_process_bad_pages(xe);
+	if (err)
+		return err;
+
 	/*
 	 * Now that GT is initialized (TTM in particular),
 	 * we can try to init display, and inherit the initial fb.
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index 0fec7b332501..cd64a6b49f14 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -12,6 +12,7 @@
 #include <drm/ttm/ttm_range_manager.h>
 
 #include "xe_bo.h"
+#include "xe_configfs.h"
 #include "xe_device.h"
 #include "xe_exec_queue.h"
 #include "xe_lrc.h"
@@ -791,6 +792,7 @@ int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long addr)
 	struct xe_ttm_vram_mgr *vram_mgr;
 	struct xe_vram_region *vr;
 	struct gpu_buddy *mm;
+	bool policy;
 	int ret;
 
 	vr = xe_ttm_vram_addr_to_region(xe, addr);
@@ -802,6 +804,15 @@ int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long addr)
 	}
 	vram_mgr = &vr->ttm;
 	mm = &vram_mgr->mm;
+
+	policy = xe_configfs_get_bad_page_reservation(to_pci_dev(xe->drm.dev));
+	if (!policy) {
+		drm_err(&xe->drm, "0x%lx is reported as corrupted address by HW\n",
+			addr);
+		/* TODO: FW Integration: Report to FW to drop addr from SRAM queue */
+		return 0;
+	}
+
 	/* Reserve page at address */
 	ret = xe_ttm_vram_reserve_page_at_addr(xe, addr, vram_mgr, mm);
 	return ret;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH V6 7/7] drm/xe/cri: Add sysfs interface for bad gpu vram pages
  2026-03-27 11:48 [RFC PATCH V6 0/7] Add memory page offlining support Tejas Upadhyay
                   ` (5 preceding siblings ...)
  2026-03-27 11:48 ` [RFC PATCH V6 6/7] drm/xe/configfs: Add vram bad page reservation policy Tejas Upadhyay
@ 2026-03-27 11:48 ` Tejas Upadhyay
  2026-03-27 12:24 ` ✗ CI.checkpatch: warning for Add memory page offlining support (rev6) Patchwork
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Tejas Upadhyay @ 2026-03-27 11:48 UTC (permalink / raw)
  To: intel-xe
  Cc: matthew.auld, matthew.brost, thomas.hellstrom,
	himal.prasad.ghimiray, Tejas Upadhyay

Starting CRI, Include a sysfs interface designed to expose information
about bad VRAM pages—those identified as having hardware faults
(e.g., ECC errors). This interface allows userspace tools and
administrators to monitor the health of the GPU's local memory and
track the status of page retirement.To get details on bad gpu vram
pages can be found under /sys/bus/pci/devices/bdf/vram_bad_pages.

Where The format is, pfn : gpu page size : flags

flags:
R: reserved, this gpu page is reserved.
P: pending for reserve, this gpu page is marked as bad, will be reserved in next window of page_reserve.
F: unable to reserve. this gpu page can’t be reserved due to some reasons.

For example if you read using cat /sys/bus/pci/devices/bdf/vram_bad_pages,
max_pages : 10000
0x00000000 : 0x00001000 : R
0x00001234 : 0x00001000 : P

v2:
- Add max_pages info as per updated design doc
- Rebase

Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
---
 drivers/gpu/drm/xe/xe_device.c             |  7 ++
 drivers/gpu/drm/xe/xe_device_sysfs.c       |  7 ++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       | 77 ++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_ttm_vram_mgr.h       |  1 +
 drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  2 +
 5 files changed, 94 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 795256f9fdc7..27d14ca449a6 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -839,6 +839,8 @@ static int xe_device_process_bad_pages(struct xe_device *xe)
 {
 	unsigned long offlined[1] = {0x0};
 	unsigned long queued[1] = {0x3000};
+	struct ttm_resource_manager *man;
+	struct xe_ttm_vram_mgr *mgr;
 	int n_bad_pages = ARRAY_SIZE(offlined) + ARRAY_SIZE(queued);
 	unsigned long *bad_pages;
 	bool policy;
@@ -848,6 +850,11 @@ static int xe_device_process_bad_pages(struct xe_device *xe)
 		return 0;
 
 	/* TODO: FW Integration: Query FW for offline/queued pages */
+	/* retrieve and fill max_pages from FW */
+	man = ttm_manager_type(&xe->ttm, XE_PL_VRAM0);
+	WARN_ON(!man);
+	mgr = to_xe_ttm_vram_mgr(man);
+	mgr->max_pages = 10000;
 
 	if (!n_bad_pages)
 		return 0;
diff --git a/drivers/gpu/drm/xe/xe_device_sysfs.c b/drivers/gpu/drm/xe/xe_device_sysfs.c
index a73e0e957cb0..47c5be4180fe 100644
--- a/drivers/gpu/drm/xe/xe_device_sysfs.c
+++ b/drivers/gpu/drm/xe/xe_device_sysfs.c
@@ -8,12 +8,14 @@
 #include <linux/pci.h>
 #include <linux/sysfs.h>
 
+#include "xe_configfs.h"
 #include "xe_device.h"
 #include "xe_device_sysfs.h"
 #include "xe_mmio.h"
 #include "xe_pcode_api.h"
 #include "xe_pcode.h"
 #include "xe_pm.h"
+#include "xe_ttm_vram_mgr.h"
 
 /**
  * DOC: Xe device sysfs
@@ -267,6 +269,7 @@ static const struct attribute_group auto_link_downgrade_attr_group = {
 int xe_device_sysfs_init(struct xe_device *xe)
 {
 	struct device *dev = xe->drm.dev;
+	bool policy;
 	int ret;
 
 	if (xe->d3cold.capable) {
@@ -285,5 +288,9 @@ int xe_device_sysfs_init(struct xe_device *xe)
 			return ret;
 	}
 
+	policy = xe_configfs_get_bad_page_reservation(to_pci_dev(dev));
+	if (xe->info.platform == XE_CRESCENTISLAND && policy)
+		xe_ttm_vram_sysfs_init(xe);
+
 	return 0;
 }
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
index cd64a6b49f14..a12e8d268f90 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
@@ -818,3 +818,80 @@ int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long addr)
 	return ret;
 }
 EXPORT_SYMBOL(xe_ttm_vram_handle_addr_fault);
+
+static void xe_ttm_vram_dump_bad_pages_info(char *buf, struct xe_ttm_vram_mgr *mgr)
+{
+	const unsigned int element_size = sizeof("0xabcdabcd : 0x12345678 : R\n") - 1;
+	const unsigned int maxpage_size = sizeof("max_pages: 10000\n") - 1;
+	struct xe_ttm_vram_offline_resource *pos, *n;
+	struct gpu_buddy_block *block;
+	ssize_t s = 0;
+
+	mutex_lock(&mgr->lock);
+	s += scnprintf(&buf[s], maxpage_size + 1, "max_pages: %d\n", mgr->max_pages);
+	list_for_each_entry_safe(pos, n, &mgr->offlined_pages, offlined_link) {
+		block = list_first_entry(&pos->blocks,
+					 struct gpu_buddy_block,
+					 link);
+		s += scnprintf(&buf[s], element_size + 1,
+			       "0x%08llx : 0x%08llx : %1s\n",
+			       gpu_buddy_block_offset(block) >> PAGE_SHIFT,
+			       gpu_buddy_block_size(&mgr->mm, block),
+			       "R");
+	}
+	list_for_each_entry_safe(pos, n, &mgr->queued_pages, queued_link) {
+		block = list_first_entry(&pos->blocks,
+					 struct gpu_buddy_block,
+					 link);
+		s += scnprintf(&buf[s], element_size + 1,
+			       "0x%08llx : 0x%08llx : %1s\n",
+			       gpu_buddy_block_offset(block) >> PAGE_SHIFT,
+			       gpu_buddy_block_size(&mgr->mm, block),
+			       pos->status ? "P" : "F");
+	}
+	mutex_unlock(&mgr->lock);
+}
+
+static ssize_t vram_bad_pages_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct xe_device *xe = pdev_to_xe_device(pdev);
+	struct ttm_resource_manager *man;
+	struct xe_ttm_vram_mgr *mgr;
+
+	man = ttm_manager_type(&xe->ttm, XE_PL_VRAM0);
+	if (man) {
+		mgr = to_xe_ttm_vram_mgr(man);
+		xe_ttm_vram_dump_bad_pages_info(buf, mgr);
+	}
+
+	return sysfs_emit(buf, "%s\n", buf);
+}
+static DEVICE_ATTR_RO(vram_bad_pages);
+
+static void xe_ttm_vram_sysfs_fini(void *arg)
+{
+	struct xe_device *xe = arg;
+
+	device_remove_file(xe->drm.dev, &dev_attr_vram_bad_pages);
+}
+
+/**
+ * xe_ttm_vram_sysfs_init - Initialize vram sysfs component
+ * @tile: Xe Tile object
+ *
+ * It needs to be initialized after the main tile component is ready
+ *
+ * Returns: 0 on success, negative error code on error.
+ */
+int xe_ttm_vram_sysfs_init(struct xe_device *xe)
+{
+	int err;
+
+	err = device_create_file(xe->drm.dev, &dev_attr_vram_bad_pages);
+	if (err)
+		return 0;
+
+	return devm_add_action_or_reset(xe->drm.dev, xe_ttm_vram_sysfs_fini, xe);
+}
+EXPORT_SYMBOL(xe_ttm_vram_sysfs_init);
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
index 8ef06d9d44f7..c33e1a8d9217 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
@@ -32,6 +32,7 @@ void xe_ttm_vram_get_used(struct ttm_resource_manager *man,
 			  u64 *used, u64 *used_visible);
 
 int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long addr);
+int xe_ttm_vram_sysfs_init(struct xe_device *xe);
 static inline struct xe_ttm_vram_mgr_resource *
 to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
 {
diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
index 65245668c183..33fb38238943 100644
--- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
+++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
@@ -39,6 +39,8 @@ struct xe_ttm_vram_mgr {
 	u32 mem_type;
 	/** @offline_mode: debugfs hook for setting page offline mode */
 	u64 offline_mode;
+	/** @max_pages: max pages that can be in offline queue retrieved from FW */
+	u16 max_pages;
 };
 
 /**
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* ✗ CI.checkpatch: warning for Add memory page offlining support (rev6)
  2026-03-27 11:48 [RFC PATCH V6 0/7] Add memory page offlining support Tejas Upadhyay
                   ` (6 preceding siblings ...)
  2026-03-27 11:48 ` [RFC PATCH V6 7/7] drm/xe/cri: Add sysfs interface for bad gpu vram pages Tejas Upadhyay
@ 2026-03-27 12:24 ` Patchwork
  2026-03-27 12:26 ` ✓ CI.KUnit: success " Patchwork
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Patchwork @ 2026-03-27 12:24 UTC (permalink / raw)
  To: Tejas Upadhyay; +Cc: intel-xe

== Series Details ==

Series: Add memory page offlining support (rev6)
URL   : https://patchwork.freedesktop.org/series/161473/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
1f57ba1afceae32108bd24770069f764d940a0e4
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit d77eadc6cbaa6fea9a63783d2ff4f3ae3eb62d35
Author: Tejas Upadhyay <tejas.upadhyay@intel.com>
Date:   Fri Mar 27 17:18:20 2026 +0530

    drm/xe/cri: Add sysfs interface for bad gpu vram pages
    
    Starting CRI, Include a sysfs interface designed to expose information
    about bad VRAM pages—those identified as having hardware faults
    (e.g., ECC errors). This interface allows userspace tools and
    administrators to monitor the health of the GPU's local memory and
    track the status of page retirement.To get details on bad gpu vram
    pages can be found under /sys/bus/pci/devices/bdf/vram_bad_pages.
    
    Where The format is, pfn : gpu page size : flags
    
    flags:
    R: reserved, this gpu page is reserved.
    P: pending for reserve, this gpu page is marked as bad, will be reserved in next window of page_reserve.
    F: unable to reserve. this gpu page can’t be reserved due to some reasons.
    
    For example if you read using cat /sys/bus/pci/devices/bdf/vram_bad_pages,
    max_pages : 10000
    0x00000000 : 0x00001000 : R
    0x00001234 : 0x00001000 : P
    
    v2:
    - Add max_pages info as per updated design doc
    - Rebase
    
    Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
+ /mt/dim checkpatch b0c967b64ac6322238cc1cbabd75a3be31fbb637 drm-intel
509bd1ae8f79 drm/xe: Link VRAM object with gpu buddy
5727c34a432a drm/gpu: Add gpu_buddy_addr_to_block helper
112b93f6040d drm/xe: Handle physical memory address error
-:13: ERROR:BAD_COMMIT_SEPARATOR: Invalid commit separator - some tools may have problems applying this
#13: 
----------------------

-:20: ERROR:BAD_COMMIT_SEPARATOR: Invalid commit separator - some tools may have problems applying this
#20: 
-----------------

-:32: ERROR:BAD_COMMIT_SEPARATOR: Invalid commit separator - some tools may have problems applying this
#32: 
--------------

total: 3 errors, 0 warnings, 0 checks, 407 lines checked
c824fd0fae8b drm/xe/cri: Add debugfs to inject faulty vram address
1e94c7b3d31b gpu/buddy: Add routine to dump allocated buddy blocks
-:13: WARNING:COMMIT_LOG_LONG_LINE: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#13: 
[  +0.000003] xe 0000:03:00.0: [drm] 0x00000002f8000000-0x00000002f8800000: 8388608

total: 0 errors, 1 warnings, 0 checks, 62 lines checked
35361583d033 drm/xe/configfs: Add vram bad page reservation policy
d77eadc6cbaa drm/xe/cri: Add sysfs interface for bad gpu vram pages
-:20: WARNING:COMMIT_LOG_LONG_LINE: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#20: 
P: pending for reserve, this gpu page is marked as bad, will be reserved in next window of page_reserve.

total: 0 errors, 1 warnings, 0 checks, 144 lines checked



^ permalink raw reply	[flat|nested] 24+ messages in thread

* ✓ CI.KUnit: success for Add memory page offlining support (rev6)
  2026-03-27 11:48 [RFC PATCH V6 0/7] Add memory page offlining support Tejas Upadhyay
                   ` (7 preceding siblings ...)
  2026-03-27 12:24 ` ✗ CI.checkpatch: warning for Add memory page offlining support (rev6) Patchwork
@ 2026-03-27 12:26 ` Patchwork
  2026-03-27 13:16 ` ✓ Xe.CI.BAT: " Patchwork
  2026-03-28  4:49 ` ✓ Xe.CI.FULL: " Patchwork
  10 siblings, 0 replies; 24+ messages in thread
From: Patchwork @ 2026-03-27 12:26 UTC (permalink / raw)
  To: Tejas Upadhyay; +Cc: intel-xe

== Series Details ==

Series: Add memory page offlining support (rev6)
URL   : https://patchwork.freedesktop.org/series/161473/
State : success

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
[12:24:51] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[12:24:55] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[12:25:39] Starting KUnit Kernel (1/1)...
[12:25:39] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[12:25:39] ================== guc_buf (11 subtests) ===================
[12:25:39] [PASSED] test_smallest
[12:25:39] [PASSED] test_largest
[12:25:39] [PASSED] test_granular
[12:25:39] [PASSED] test_unique
[12:25:39] [PASSED] test_overlap
[12:25:39] [PASSED] test_reusable
[12:25:39] [PASSED] test_too_big
[12:25:39] [PASSED] test_flush
[12:25:39] [PASSED] test_lookup
[12:25:39] [PASSED] test_data
[12:25:39] [PASSED] test_class
[12:25:39] ===================== [PASSED] guc_buf =====================
[12:25:39] =================== guc_dbm (7 subtests) ===================
[12:25:39] [PASSED] test_empty
[12:25:39] [PASSED] test_default
[12:25:39] ======================== test_size  ========================
[12:25:39] [PASSED] 4
[12:25:39] [PASSED] 8
[12:25:39] [PASSED] 32
[12:25:39] [PASSED] 256
[12:25:39] ==================== [PASSED] test_size ====================
[12:25:39] ======================= test_reuse  ========================
[12:25:39] [PASSED] 4
[12:25:39] [PASSED] 8
[12:25:39] [PASSED] 32
[12:25:39] [PASSED] 256
[12:25:39] =================== [PASSED] test_reuse ====================
[12:25:39] =================== test_range_overlap  ====================
[12:25:39] [PASSED] 4
[12:25:39] [PASSED] 8
[12:25:39] [PASSED] 32
[12:25:39] [PASSED] 256
[12:25:39] =============== [PASSED] test_range_overlap ================
[12:25:39] =================== test_range_compact  ====================
[12:25:39] [PASSED] 4
[12:25:39] [PASSED] 8
[12:25:39] [PASSED] 32
[12:25:39] [PASSED] 256
[12:25:39] =============== [PASSED] test_range_compact ================
[12:25:39] ==================== test_range_spare  =====================
[12:25:39] [PASSED] 4
[12:25:39] [PASSED] 8
[12:25:39] [PASSED] 32
[12:25:39] [PASSED] 256
[12:25:39] ================ [PASSED] test_range_spare =================
[12:25:39] ===================== [PASSED] guc_dbm =====================
[12:25:39] =================== guc_idm (6 subtests) ===================
[12:25:39] [PASSED] bad_init
[12:25:39] [PASSED] no_init
[12:25:39] [PASSED] init_fini
[12:25:39] [PASSED] check_used
[12:25:39] [PASSED] check_quota
[12:25:39] [PASSED] check_all
[12:25:39] ===================== [PASSED] guc_idm =====================
[12:25:39] ================== no_relay (3 subtests) ===================
[12:25:39] [PASSED] xe_drops_guc2pf_if_not_ready
[12:25:39] [PASSED] xe_drops_guc2vf_if_not_ready
[12:25:39] [PASSED] xe_rejects_send_if_not_ready
[12:25:39] ==================== [PASSED] no_relay =====================
[12:25:39] ================== pf_relay (14 subtests) ==================
[12:25:39] [PASSED] pf_rejects_guc2pf_too_short
[12:25:39] [PASSED] pf_rejects_guc2pf_too_long
[12:25:39] [PASSED] pf_rejects_guc2pf_no_payload
[12:25:39] [PASSED] pf_fails_no_payload
[12:25:39] [PASSED] pf_fails_bad_origin
[12:25:39] [PASSED] pf_fails_bad_type
[12:25:39] [PASSED] pf_txn_reports_error
[12:25:39] [PASSED] pf_txn_sends_pf2guc
[12:25:39] [PASSED] pf_sends_pf2guc
[12:25:39] [SKIPPED] pf_loopback_nop
[12:25:39] [SKIPPED] pf_loopback_echo
[12:25:39] [SKIPPED] pf_loopback_fail
[12:25:39] [SKIPPED] pf_loopback_busy
[12:25:39] [SKIPPED] pf_loopback_retry
[12:25:39] ==================== [PASSED] pf_relay =====================
[12:25:39] ================== vf_relay (3 subtests) ===================
[12:25:39] [PASSED] vf_rejects_guc2vf_too_short
[12:25:39] [PASSED] vf_rejects_guc2vf_too_long
[12:25:39] [PASSED] vf_rejects_guc2vf_no_payload
[12:25:39] ==================== [PASSED] vf_relay =====================
[12:25:39] ================ pf_gt_config (9 subtests) =================
[12:25:39] [PASSED] fair_contexts_1vf
[12:25:39] [PASSED] fair_doorbells_1vf
[12:25:39] [PASSED] fair_ggtt_1vf
[12:25:39] ====================== fair_vram_1vf  ======================
[12:25:39] [PASSED] 3.50 GiB
[12:25:39] [PASSED] 11.5 GiB
[12:25:39] [PASSED] 15.5 GiB
[12:25:39] [PASSED] 31.5 GiB
[12:25:39] [PASSED] 63.5 GiB
[12:25:39] [PASSED] 1.91 GiB
[12:25:39] ================== [PASSED] fair_vram_1vf ==================
[12:25:39] ================ fair_vram_1vf_admin_only  =================
[12:25:39] [PASSED] 3.50 GiB
[12:25:39] [PASSED] 11.5 GiB
[12:25:39] [PASSED] 15.5 GiB
[12:25:39] [PASSED] 31.5 GiB
[12:25:39] [PASSED] 63.5 GiB
[12:25:39] [PASSED] 1.91 GiB
[12:25:39] ============ [PASSED] fair_vram_1vf_admin_only =============
[12:25:39] ====================== fair_contexts  ======================
[12:25:39] [PASSED] 1 VF
[12:25:39] [PASSED] 2 VFs
[12:25:39] [PASSED] 3 VFs
[12:25:39] [PASSED] 4 VFs
[12:25:39] [PASSED] 5 VFs
[12:25:39] [PASSED] 6 VFs
[12:25:39] [PASSED] 7 VFs
[12:25:39] [PASSED] 8 VFs
[12:25:39] [PASSED] 9 VFs
[12:25:39] [PASSED] 10 VFs
[12:25:39] [PASSED] 11 VFs
[12:25:39] [PASSED] 12 VFs
[12:25:39] [PASSED] 13 VFs
[12:25:39] [PASSED] 14 VFs
[12:25:39] [PASSED] 15 VFs
[12:25:39] [PASSED] 16 VFs
[12:25:39] [PASSED] 17 VFs
[12:25:39] [PASSED] 18 VFs
[12:25:39] [PASSED] 19 VFs
[12:25:39] [PASSED] 20 VFs
[12:25:39] [PASSED] 21 VFs
[12:25:39] [PASSED] 22 VFs
[12:25:39] [PASSED] 23 VFs
[12:25:39] [PASSED] 24 VFs
[12:25:39] [PASSED] 25 VFs
[12:25:39] [PASSED] 26 VFs
[12:25:39] [PASSED] 27 VFs
[12:25:39] [PASSED] 28 VFs
[12:25:39] [PASSED] 29 VFs
[12:25:39] [PASSED] 30 VFs
[12:25:39] [PASSED] 31 VFs
[12:25:39] [PASSED] 32 VFs
[12:25:39] [PASSED] 33 VFs
[12:25:39] [PASSED] 34 VFs
[12:25:39] [PASSED] 35 VFs
[12:25:39] [PASSED] 36 VFs
[12:25:39] [PASSED] 37 VFs
[12:25:39] [PASSED] 38 VFs
[12:25:39] [PASSED] 39 VFs
[12:25:39] [PASSED] 40 VFs
[12:25:39] [PASSED] 41 VFs
[12:25:39] [PASSED] 42 VFs
[12:25:39] [PASSED] 43 VFs
[12:25:39] [PASSED] 44 VFs
[12:25:39] [PASSED] 45 VFs
[12:25:39] [PASSED] 46 VFs
[12:25:39] [PASSED] 47 VFs
[12:25:39] [PASSED] 48 VFs
[12:25:39] [PASSED] 49 VFs
[12:25:39] [PASSED] 50 VFs
[12:25:39] [PASSED] 51 VFs
[12:25:39] [PASSED] 52 VFs
[12:25:39] [PASSED] 53 VFs
[12:25:39] [PASSED] 54 VFs
[12:25:39] [PASSED] 55 VFs
[12:25:39] [PASSED] 56 VFs
[12:25:39] [PASSED] 57 VFs
[12:25:39] [PASSED] 58 VFs
[12:25:39] [PASSED] 59 VFs
[12:25:39] [PASSED] 60 VFs
[12:25:39] [PASSED] 61 VFs
[12:25:39] [PASSED] 62 VFs
[12:25:39] [PASSED] 63 VFs
[12:25:39] ================== [PASSED] fair_contexts ==================
[12:25:39] ===================== fair_doorbells  ======================
[12:25:39] [PASSED] 1 VF
[12:25:39] [PASSED] 2 VFs
[12:25:39] [PASSED] 3 VFs
[12:25:39] [PASSED] 4 VFs
[12:25:39] [PASSED] 5 VFs
[12:25:39] [PASSED] 6 VFs
[12:25:39] [PASSED] 7 VFs
[12:25:39] [PASSED] 8 VFs
[12:25:39] [PASSED] 9 VFs
[12:25:39] [PASSED] 10 VFs
[12:25:39] [PASSED] 11 VFs
[12:25:39] [PASSED] 12 VFs
[12:25:39] [PASSED] 13 VFs
[12:25:39] [PASSED] 14 VFs
[12:25:39] [PASSED] 15 VFs
[12:25:39] [PASSED] 16 VFs
[12:25:39] [PASSED] 17 VFs
[12:25:39] [PASSED] 18 VFs
[12:25:39] [PASSED] 19 VFs
[12:25:39] [PASSED] 20 VFs
[12:25:39] [PASSED] 21 VFs
[12:25:39] [PASSED] 22 VFs
[12:25:39] [PASSED] 23 VFs
[12:25:39] [PASSED] 24 VFs
[12:25:39] [PASSED] 25 VFs
[12:25:39] [PASSED] 26 VFs
[12:25:39] [PASSED] 27 VFs
[12:25:39] [PASSED] 28 VFs
[12:25:39] [PASSED] 29 VFs
[12:25:39] [PASSED] 30 VFs
[12:25:39] [PASSED] 31 VFs
[12:25:39] [PASSED] 32 VFs
[12:25:39] [PASSED] 33 VFs
[12:25:39] [PASSED] 34 VFs
[12:25:39] [PASSED] 35 VFs
[12:25:39] [PASSED] 36 VFs
[12:25:39] [PASSED] 37 VFs
[12:25:39] [PASSED] 38 VFs
[12:25:39] [PASSED] 39 VFs
[12:25:39] [PASSED] 40 VFs
[12:25:39] [PASSED] 41 VFs
[12:25:39] [PASSED] 42 VFs
[12:25:39] [PASSED] 43 VFs
[12:25:39] [PASSED] 44 VFs
[12:25:39] [PASSED] 45 VFs
[12:25:39] [PASSED] 46 VFs
[12:25:39] [PASSED] 47 VFs
[12:25:39] [PASSED] 48 VFs
[12:25:39] [PASSED] 49 VFs
[12:25:39] [PASSED] 50 VFs
[12:25:39] [PASSED] 51 VFs
[12:25:39] [PASSED] 52 VFs
[12:25:39] [PASSED] 53 VFs
[12:25:39] [PASSED] 54 VFs
[12:25:39] [PASSED] 55 VFs
[12:25:39] [PASSED] 56 VFs
[12:25:39] [PASSED] 57 VFs
[12:25:39] [PASSED] 58 VFs
[12:25:39] [PASSED] 59 VFs
[12:25:39] [PASSED] 60 VFs
[12:25:39] [PASSED] 61 VFs
[12:25:39] [PASSED] 62 VFs
[12:25:39] [PASSED] 63 VFs
[12:25:39] ================= [PASSED] fair_doorbells ==================
[12:25:39] ======================== fair_ggtt  ========================
[12:25:39] [PASSED] 1 VF
[12:25:39] [PASSED] 2 VFs
[12:25:39] [PASSED] 3 VFs
[12:25:39] [PASSED] 4 VFs
[12:25:39] [PASSED] 5 VFs
[12:25:39] [PASSED] 6 VFs
[12:25:39] [PASSED] 7 VFs
[12:25:39] [PASSED] 8 VFs
[12:25:39] [PASSED] 9 VFs
[12:25:39] [PASSED] 10 VFs
[12:25:39] [PASSED] 11 VFs
[12:25:39] [PASSED] 12 VFs
[12:25:39] [PASSED] 13 VFs
[12:25:39] [PASSED] 14 VFs
[12:25:39] [PASSED] 15 VFs
[12:25:39] [PASSED] 16 VFs
[12:25:39] [PASSED] 17 VFs
[12:25:39] [PASSED] 18 VFs
[12:25:39] [PASSED] 19 VFs
[12:25:39] [PASSED] 20 VFs
[12:25:39] [PASSED] 21 VFs
[12:25:39] [PASSED] 22 VFs
[12:25:39] [PASSED] 23 VFs
[12:25:39] [PASSED] 24 VFs
[12:25:39] [PASSED] 25 VFs
[12:25:39] [PASSED] 26 VFs
[12:25:39] [PASSED] 27 VFs
[12:25:39] [PASSED] 28 VFs
[12:25:39] [PASSED] 29 VFs
[12:25:39] [PASSED] 30 VFs
[12:25:39] [PASSED] 31 VFs
[12:25:39] [PASSED] 32 VFs
[12:25:39] [PASSED] 33 VFs
[12:25:39] [PASSED] 34 VFs
[12:25:39] [PASSED] 35 VFs
[12:25:39] [PASSED] 36 VFs
[12:25:39] [PASSED] 37 VFs
[12:25:39] [PASSED] 38 VFs
[12:25:39] [PASSED] 39 VFs
[12:25:39] [PASSED] 40 VFs
[12:25:39] [PASSED] 41 VFs
[12:25:39] [PASSED] 42 VFs
[12:25:39] [PASSED] 43 VFs
[12:25:39] [PASSED] 44 VFs
[12:25:39] [PASSED] 45 VFs
[12:25:39] [PASSED] 46 VFs
[12:25:39] [PASSED] 47 VFs
[12:25:39] [PASSED] 48 VFs
[12:25:39] [PASSED] 49 VFs
[12:25:39] [PASSED] 50 VFs
[12:25:39] [PASSED] 51 VFs
[12:25:39] [PASSED] 52 VFs
[12:25:39] [PASSED] 53 VFs
[12:25:39] [PASSED] 54 VFs
[12:25:39] [PASSED] 55 VFs
[12:25:39] [PASSED] 56 VFs
[12:25:39] [PASSED] 57 VFs
[12:25:39] [PASSED] 58 VFs
[12:25:39] [PASSED] 59 VFs
[12:25:39] [PASSED] 60 VFs
[12:25:39] [PASSED] 61 VFs
[12:25:39] [PASSED] 62 VFs
[12:25:39] [PASSED] 63 VFs
[12:25:39] ==================== [PASSED] fair_ggtt ====================
[12:25:39] ======================== fair_vram  ========================
[12:25:39] [PASSED] 1 VF
[12:25:39] [PASSED] 2 VFs
[12:25:39] [PASSED] 3 VFs
[12:25:39] [PASSED] 4 VFs
[12:25:39] [PASSED] 5 VFs
[12:25:39] [PASSED] 6 VFs
[12:25:39] [PASSED] 7 VFs
[12:25:39] [PASSED] 8 VFs
[12:25:39] [PASSED] 9 VFs
[12:25:39] [PASSED] 10 VFs
[12:25:39] [PASSED] 11 VFs
[12:25:39] [PASSED] 12 VFs
[12:25:39] [PASSED] 13 VFs
[12:25:39] [PASSED] 14 VFs
[12:25:39] [PASSED] 15 VFs
[12:25:39] [PASSED] 16 VFs
[12:25:39] [PASSED] 17 VFs
[12:25:39] [PASSED] 18 VFs
[12:25:39] [PASSED] 19 VFs
[12:25:39] [PASSED] 20 VFs
[12:25:39] [PASSED] 21 VFs
[12:25:39] [PASSED] 22 VFs
[12:25:39] [PASSED] 23 VFs
[12:25:39] [PASSED] 24 VFs
[12:25:39] [PASSED] 25 VFs
[12:25:39] [PASSED] 26 VFs
[12:25:39] [PASSED] 27 VFs
[12:25:39] [PASSED] 28 VFs
[12:25:39] [PASSED] 29 VFs
[12:25:39] [PASSED] 30 VFs
[12:25:39] [PASSED] 31 VFs
[12:25:39] [PASSED] 32 VFs
[12:25:39] [PASSED] 33 VFs
[12:25:39] [PASSED] 34 VFs
[12:25:39] [PASSED] 35 VFs
[12:25:39] [PASSED] 36 VFs
[12:25:39] [PASSED] 37 VFs
[12:25:39] [PASSED] 38 VFs
[12:25:39] [PASSED] 39 VFs
[12:25:39] [PASSED] 40 VFs
[12:25:39] [PASSED] 41 VFs
[12:25:39] [PASSED] 42 VFs
[12:25:39] [PASSED] 43 VFs
[12:25:39] [PASSED] 44 VFs
[12:25:39] [PASSED] 45 VFs
[12:25:39] [PASSED] 46 VFs
[12:25:39] [PASSED] 47 VFs
[12:25:39] [PASSED] 48 VFs
[12:25:39] [PASSED] 49 VFs
[12:25:39] [PASSED] 50 VFs
[12:25:39] [PASSED] 51 VFs
[12:25:39] [PASSED] 52 VFs
[12:25:39] [PASSED] 53 VFs
[12:25:39] [PASSED] 54 VFs
[12:25:39] [PASSED] 55 VFs
[12:25:39] [PASSED] 56 VFs
[12:25:39] [PASSED] 57 VFs
[12:25:39] [PASSED] 58 VFs
[12:25:39] [PASSED] 59 VFs
[12:25:39] [PASSED] 60 VFs
[12:25:39] [PASSED] 61 VFs
[12:25:39] [PASSED] 62 VFs
[12:25:39] [PASSED] 63 VFs
[12:25:39] ==================== [PASSED] fair_vram ====================
[12:25:39] ================== [PASSED] pf_gt_config ===================
[12:25:39] ===================== lmtt (1 subtest) =====================
[12:25:39] ======================== test_ops  =========================
[12:25:39] [PASSED] 2-level
[12:25:39] [PASSED] multi-level
[12:25:39] ==================== [PASSED] test_ops =====================
[12:25:39] ====================== [PASSED] lmtt =======================
[12:25:39] ================= pf_service (11 subtests) =================
[12:25:39] [PASSED] pf_negotiate_any
[12:25:39] [PASSED] pf_negotiate_base_match
[12:25:39] [PASSED] pf_negotiate_base_newer
[12:25:39] [PASSED] pf_negotiate_base_next
[12:25:39] [SKIPPED] pf_negotiate_base_older
[12:25:39] [PASSED] pf_negotiate_base_prev
[12:25:39] [PASSED] pf_negotiate_latest_match
[12:25:39] [PASSED] pf_negotiate_latest_newer
[12:25:39] [PASSED] pf_negotiate_latest_next
[12:25:39] [SKIPPED] pf_negotiate_latest_older
[12:25:39] [SKIPPED] pf_negotiate_latest_prev
[12:25:39] =================== [PASSED] pf_service ====================
[12:25:39] ================= xe_guc_g2g (2 subtests) ==================
[12:25:39] ============== xe_live_guc_g2g_kunit_default  ==============
[12:25:39] ========= [SKIPPED] xe_live_guc_g2g_kunit_default ==========
[12:25:39] ============== xe_live_guc_g2g_kunit_allmem  ===============
[12:25:39] ========== [SKIPPED] xe_live_guc_g2g_kunit_allmem ==========
[12:25:39] =================== [SKIPPED] xe_guc_g2g ===================
[12:25:39] =================== xe_mocs (2 subtests) ===================
[12:25:39] ================ xe_live_mocs_kernel_kunit  ================
[12:25:39] =========== [SKIPPED] xe_live_mocs_kernel_kunit ============
[12:25:39] ================ xe_live_mocs_reset_kunit  =================
[12:25:39] ============ [SKIPPED] xe_live_mocs_reset_kunit ============
[12:25:39] ==================== [SKIPPED] xe_mocs =====================
[12:25:39] ================= xe_migrate (2 subtests) ==================
[12:25:39] ================= xe_migrate_sanity_kunit  =================
[12:25:39] ============ [SKIPPED] xe_migrate_sanity_kunit =============
[12:25:39] ================== xe_validate_ccs_kunit  ==================
[12:25:39] ============= [SKIPPED] xe_validate_ccs_kunit ==============
[12:25:39] =================== [SKIPPED] xe_migrate ===================
[12:25:39] ================== xe_dma_buf (1 subtest) ==================
[12:25:39] ==================== xe_dma_buf_kunit  =====================
[12:25:39] ================ [SKIPPED] xe_dma_buf_kunit ================
[12:25:39] =================== [SKIPPED] xe_dma_buf ===================
[12:25:39] ================= xe_bo_shrink (1 subtest) =================
[12:25:39] =================== xe_bo_shrink_kunit  ====================
[12:25:39] =============== [SKIPPED] xe_bo_shrink_kunit ===============
[12:25:39] ================== [SKIPPED] xe_bo_shrink ==================
[12:25:39] ==================== xe_bo (2 subtests) ====================
[12:25:39] ================== xe_ccs_migrate_kunit  ===================
[12:25:39] ============== [SKIPPED] xe_ccs_migrate_kunit ==============
[12:25:39] ==================== xe_bo_evict_kunit  ====================
[12:25:39] =============== [SKIPPED] xe_bo_evict_kunit ================
[12:25:39] ===================== [SKIPPED] xe_bo ======================
[12:25:39] ==================== args (13 subtests) ====================
[12:25:39] [PASSED] count_args_test
[12:25:39] [PASSED] call_args_example
[12:25:39] [PASSED] call_args_test
[12:25:39] [PASSED] drop_first_arg_example
[12:25:39] [PASSED] drop_first_arg_test
[12:25:39] [PASSED] first_arg_example
[12:25:39] [PASSED] first_arg_test
[12:25:39] [PASSED] last_arg_example
[12:25:39] [PASSED] last_arg_test
[12:25:39] [PASSED] pick_arg_example
[12:25:39] [PASSED] if_args_example
[12:25:39] [PASSED] if_args_test
[12:25:39] [PASSED] sep_comma_example
[12:25:39] ====================== [PASSED] args =======================
[12:25:39] =================== xe_pci (3 subtests) ====================
[12:25:39] ==================== check_graphics_ip  ====================
[12:25:39] [PASSED] 12.00 Xe_LP
[12:25:39] [PASSED] 12.10 Xe_LP+
[12:25:39] [PASSED] 12.55 Xe_HPG
[12:25:39] [PASSED] 12.60 Xe_HPC
[12:25:39] [PASSED] 12.70 Xe_LPG
[12:25:39] [PASSED] 12.71 Xe_LPG
[12:25:39] [PASSED] 12.74 Xe_LPG+
[12:25:39] [PASSED] 20.01 Xe2_HPG
[12:25:39] [PASSED] 20.02 Xe2_HPG
[12:25:39] [PASSED] 20.04 Xe2_LPG
[12:25:39] [PASSED] 30.00 Xe3_LPG
[12:25:39] [PASSED] 30.01 Xe3_LPG
[12:25:39] [PASSED] 30.03 Xe3_LPG
[12:25:39] [PASSED] 30.04 Xe3_LPG
[12:25:39] [PASSED] 30.05 Xe3_LPG
[12:25:39] [PASSED] 35.10 Xe3p_LPG
[12:25:39] [PASSED] 35.11 Xe3p_XPC
[12:25:39] ================ [PASSED] check_graphics_ip ================
[12:25:39] ===================== check_media_ip  ======================
[12:25:39] [PASSED] 12.00 Xe_M
[12:25:39] [PASSED] 12.55 Xe_HPM
[12:25:39] [PASSED] 13.00 Xe_LPM+
[12:25:39] [PASSED] 13.01 Xe2_HPM
[12:25:39] [PASSED] 20.00 Xe2_LPM
[12:25:39] [PASSED] 30.00 Xe3_LPM
[12:25:39] [PASSED] 30.02 Xe3_LPM
[12:25:39] [PASSED] 35.00 Xe3p_LPM
[12:25:39] [PASSED] 35.03 Xe3p_HPM
[12:25:39] ================= [PASSED] check_media_ip ==================
[12:25:39] =================== check_platform_desc  ===================
[12:25:39] [PASSED] 0x9A60 (TIGERLAKE)
[12:25:39] [PASSED] 0x9A68 (TIGERLAKE)
[12:25:39] [PASSED] 0x9A70 (TIGERLAKE)
[12:25:39] [PASSED] 0x9A40 (TIGERLAKE)
[12:25:39] [PASSED] 0x9A49 (TIGERLAKE)
[12:25:39] [PASSED] 0x9A59 (TIGERLAKE)
[12:25:39] [PASSED] 0x9A78 (TIGERLAKE)
[12:25:39] [PASSED] 0x9AC0 (TIGERLAKE)
[12:25:39] [PASSED] 0x9AC9 (TIGERLAKE)
[12:25:39] [PASSED] 0x9AD9 (TIGERLAKE)
[12:25:39] [PASSED] 0x9AF8 (TIGERLAKE)
[12:25:39] [PASSED] 0x4C80 (ROCKETLAKE)
[12:25:39] [PASSED] 0x4C8A (ROCKETLAKE)
[12:25:39] [PASSED] 0x4C8B (ROCKETLAKE)
[12:25:39] [PASSED] 0x4C8C (ROCKETLAKE)
[12:25:39] [PASSED] 0x4C90 (ROCKETLAKE)
[12:25:39] [PASSED] 0x4C9A (ROCKETLAKE)
[12:25:39] [PASSED] 0x4680 (ALDERLAKE_S)
[12:25:39] [PASSED] 0x4682 (ALDERLAKE_S)
[12:25:39] [PASSED] 0x4688 (ALDERLAKE_S)
[12:25:39] [PASSED] 0x468A (ALDERLAKE_S)
[12:25:39] [PASSED] 0x468B (ALDERLAKE_S)
[12:25:39] [PASSED] 0x4690 (ALDERLAKE_S)
[12:25:39] [PASSED] 0x4692 (ALDERLAKE_S)
[12:25:39] [PASSED] 0x4693 (ALDERLAKE_S)
[12:25:39] [PASSED] 0x46A0 (ALDERLAKE_P)
[12:25:39] [PASSED] 0x46A1 (ALDERLAKE_P)
[12:25:39] [PASSED] 0x46A2 (ALDERLAKE_P)
[12:25:39] [PASSED] 0x46A3 (ALDERLAKE_P)
[12:25:39] [PASSED] 0x46A6 (ALDERLAKE_P)
[12:25:39] [PASSED] 0x46A8 (ALDERLAKE_P)
[12:25:39] [PASSED] 0x46AA (ALDERLAKE_P)
[12:25:39] [PASSED] 0x462A (ALDERLAKE_P)
[12:25:39] [PASSED] 0x4626 (ALDERLAKE_P)
[12:25:39] [PASSED] 0x4628 (ALDERLAKE_P)
[12:25:39] [PASSED] 0x46B0 (ALDERLAKE_P)
[12:25:39] [PASSED] 0x46B1 (ALDERLAKE_P)
[12:25:39] [PASSED] 0x46B2 (ALDERLAKE_P)
[12:25:39] [PASSED] 0x46B3 (ALDERLAKE_P)
[12:25:39] [PASSED] 0x46C0 (ALDERLAKE_P)
[12:25:39] [PASSED] 0x46C1 (ALDERLAKE_P)
[12:25:39] [PASSED] 0x46C2 (ALDERLAKE_P)
[12:25:39] [PASSED] 0x46C3 (ALDERLAKE_P)
[12:25:39] [PASSED] 0x46D0 (ALDERLAKE_N)
[12:25:39] [PASSED] 0x46D1 (ALDERLAKE_N)
[12:25:39] [PASSED] 0x46D2 (ALDERLAKE_N)
[12:25:39] [PASSED] 0x46D3 (ALDERLAKE_N)
[12:25:39] [PASSED] 0x46D4 (ALDERLAKE_N)
[12:25:39] [PASSED] 0xA721 (ALDERLAKE_P)
[12:25:39] [PASSED] 0xA7A1 (ALDERLAKE_P)
[12:25:39] [PASSED] 0xA7A9 (ALDERLAKE_P)
[12:25:39] [PASSED] 0xA7AC (ALDERLAKE_P)
[12:25:39] [PASSED] 0xA7AD (ALDERLAKE_P)
[12:25:39] [PASSED] 0xA720 (ALDERLAKE_P)
[12:25:39] [PASSED] 0xA7A0 (ALDERLAKE_P)
[12:25:39] [PASSED] 0xA7A8 (ALDERLAKE_P)
[12:25:39] [PASSED] 0xA7AA (ALDERLAKE_P)
[12:25:39] [PASSED] 0xA7AB (ALDERLAKE_P)
[12:25:39] [PASSED] 0xA780 (ALDERLAKE_S)
[12:25:39] [PASSED] 0xA781 (ALDERLAKE_S)
[12:25:39] [PASSED] 0xA782 (ALDERLAKE_S)
[12:25:39] [PASSED] 0xA783 (ALDERLAKE_S)
[12:25:39] [PASSED] 0xA788 (ALDERLAKE_S)
[12:25:39] [PASSED] 0xA789 (ALDERLAKE_S)
[12:25:39] [PASSED] 0xA78A (ALDERLAKE_S)
[12:25:39] [PASSED] 0xA78B (ALDERLAKE_S)
[12:25:39] [PASSED] 0x4905 (DG1)
[12:25:39] [PASSED] 0x4906 (DG1)
[12:25:39] [PASSED] 0x4907 (DG1)
[12:25:39] [PASSED] 0x4908 (DG1)
[12:25:39] [PASSED] 0x4909 (DG1)
[12:25:39] [PASSED] 0x56C0 (DG2)
[12:25:39] [PASSED] 0x56C2 (DG2)
[12:25:39] [PASSED] 0x56C1 (DG2)
[12:25:39] [PASSED] 0x7D51 (METEORLAKE)
[12:25:39] [PASSED] 0x7DD1 (METEORLAKE)
[12:25:39] [PASSED] 0x7D41 (METEORLAKE)
[12:25:39] [PASSED] 0x7D67 (METEORLAKE)
[12:25:39] [PASSED] 0xB640 (METEORLAKE)
[12:25:39] [PASSED] 0x56A0 (DG2)
[12:25:39] [PASSED] 0x56A1 (DG2)
[12:25:39] [PASSED] 0x56A2 (DG2)
[12:25:39] [PASSED] 0x56BE (DG2)
[12:25:39] [PASSED] 0x56BF (DG2)
[12:25:39] [PASSED] 0x5690 (DG2)
[12:25:39] [PASSED] 0x5691 (DG2)
[12:25:39] [PASSED] 0x5692 (DG2)
[12:25:39] [PASSED] 0x56A5 (DG2)
[12:25:39] [PASSED] 0x56A6 (DG2)
[12:25:39] [PASSED] 0x56B0 (DG2)
[12:25:39] [PASSED] 0x56B1 (DG2)
[12:25:39] [PASSED] 0x56BA (DG2)
[12:25:39] [PASSED] 0x56BB (DG2)
[12:25:39] [PASSED] 0x56BC (DG2)
[12:25:39] [PASSED] 0x56BD (DG2)
[12:25:39] [PASSED] 0x5693 (DG2)
[12:25:39] [PASSED] 0x5694 (DG2)
[12:25:39] [PASSED] 0x5695 (DG2)
[12:25:39] [PASSED] 0x56A3 (DG2)
[12:25:39] [PASSED] 0x56A4 (DG2)
[12:25:39] [PASSED] 0x56B2 (DG2)
[12:25:39] [PASSED] 0x56B3 (DG2)
[12:25:39] [PASSED] 0x5696 (DG2)
[12:25:39] [PASSED] 0x5697 (DG2)
[12:25:39] [PASSED] 0xB69 (PVC)
[12:25:39] [PASSED] 0xB6E (PVC)
[12:25:39] [PASSED] 0xBD4 (PVC)
[12:25:39] [PASSED] 0xBD5 (PVC)
[12:25:39] [PASSED] 0xBD6 (PVC)
[12:25:39] [PASSED] 0xBD7 (PVC)
[12:25:39] [PASSED] 0xBD8 (PVC)
[12:25:39] [PASSED] 0xBD9 (PVC)
[12:25:39] [PASSED] 0xBDA (PVC)
[12:25:39] [PASSED] 0xBDB (PVC)
[12:25:39] [PASSED] 0xBE0 (PVC)
[12:25:39] [PASSED] 0xBE1 (PVC)
[12:25:39] [PASSED] 0xBE5 (PVC)
[12:25:39] [PASSED] 0x7D40 (METEORLAKE)
[12:25:39] [PASSED] 0x7D45 (METEORLAKE)
[12:25:39] [PASSED] 0x7D55 (METEORLAKE)
[12:25:39] [PASSED] 0x7D60 (METEORLAKE)
[12:25:39] [PASSED] 0x7DD5 (METEORLAKE)
[12:25:39] [PASSED] 0x6420 (LUNARLAKE)
[12:25:39] [PASSED] 0x64A0 (LUNARLAKE)
[12:25:39] [PASSED] 0x64B0 (LUNARLAKE)
[12:25:39] [PASSED] 0xE202 (BATTLEMAGE)
[12:25:39] [PASSED] 0xE209 (BATTLEMAGE)
[12:25:39] [PASSED] 0xE20B (BATTLEMAGE)
[12:25:39] [PASSED] 0xE20C (BATTLEMAGE)
[12:25:39] [PASSED] 0xE20D (BATTLEMAGE)
[12:25:39] [PASSED] 0xE210 (BATTLEMAGE)
[12:25:39] [PASSED] 0xE211 (BATTLEMAGE)
[12:25:39] [PASSED] 0xE212 (BATTLEMAGE)
[12:25:39] [PASSED] 0xE216 (BATTLEMAGE)
[12:25:39] [PASSED] 0xE220 (BATTLEMAGE)
[12:25:39] [PASSED] 0xE221 (BATTLEMAGE)
[12:25:39] [PASSED] 0xE222 (BATTLEMAGE)
[12:25:39] [PASSED] 0xE223 (BATTLEMAGE)
[12:25:39] [PASSED] 0xB080 (PANTHERLAKE)
[12:25:39] [PASSED] 0xB081 (PANTHERLAKE)
[12:25:39] [PASSED] 0xB082 (PANTHERLAKE)
[12:25:39] [PASSED] 0xB083 (PANTHERLAKE)
[12:25:39] [PASSED] 0xB084 (PANTHERLAKE)
[12:25:39] [PASSED] 0xB085 (PANTHERLAKE)
[12:25:39] [PASSED] 0xB086 (PANTHERLAKE)
[12:25:39] [PASSED] 0xB087 (PANTHERLAKE)
[12:25:39] [PASSED] 0xB08F (PANTHERLAKE)
[12:25:39] [PASSED] 0xB090 (PANTHERLAKE)
[12:25:39] [PASSED] 0xB0A0 (PANTHERLAKE)
[12:25:39] [PASSED] 0xB0B0 (PANTHERLAKE)
[12:25:39] [PASSED] 0xFD80 (PANTHERLAKE)
[12:25:39] [PASSED] 0xFD81 (PANTHERLAKE)
[12:25:39] [PASSED] 0xD740 (NOVALAKE_S)
[12:25:39] [PASSED] 0xD741 (NOVALAKE_S)
[12:25:39] [PASSED] 0xD742 (NOVALAKE_S)
[12:25:39] [PASSED] 0xD743 (NOVALAKE_S)
[12:25:39] [PASSED] 0xD744 (NOVALAKE_S)
[12:25:39] [PASSED] 0xD745 (NOVALAKE_S)
[12:25:39] [PASSED] 0x674C (CRESCENTISLAND)
[12:25:39] [PASSED] 0xD750 (NOVALAKE_P)
[12:25:39] [PASSED] 0xD751 (NOVALAKE_P)
[12:25:39] [PASSED] 0xD752 (NOVALAKE_P)
[12:25:39] [PASSED] 0xD753 (NOVALAKE_P)
[12:25:39] [PASSED] 0xD754 (NOVALAKE_P)
[12:25:39] [PASSED] 0xD755 (NOVALAKE_P)
[12:25:39] [PASSED] 0xD756 (NOVALAKE_P)
[12:25:39] [PASSED] 0xD757 (NOVALAKE_P)
[12:25:39] [PASSED] 0xD75F (NOVALAKE_P)
[12:25:39] =============== [PASSED] check_platform_desc ===============
[12:25:39] ===================== [PASSED] xe_pci ======================
[12:25:39] =================== xe_rtp (2 subtests) ====================
[12:25:39] =============== xe_rtp_process_to_sr_tests  ================
[12:25:39] [PASSED] coalesce-same-reg
[12:25:39] [PASSED] no-match-no-add
[12:25:39] [PASSED] match-or
[12:25:39] [PASSED] match-or-xfail
[12:25:39] [PASSED] no-match-no-add-multiple-rules
[12:25:39] [PASSED] two-regs-two-entries
[12:25:39] [PASSED] clr-one-set-other
[12:25:39] [PASSED] set-field
[12:25:39] [PASSED] conflict-duplicate
stty: 'standard input': Inappropriate ioctl for device
[12:25:39] [PASSED] conflict-not-disjoint
[12:25:39] [PASSED] conflict-reg-type
[12:25:39] =========== [PASSED] xe_rtp_process_to_sr_tests ============
[12:25:39] ================== xe_rtp_process_tests  ===================
[12:25:39] [PASSED] active1
[12:25:39] [PASSED] active2
[12:25:39] [PASSED] active-inactive
[12:25:39] [PASSED] inactive-active
[12:25:39] [PASSED] inactive-1st_or_active-inactive
[12:25:39] [PASSED] inactive-2nd_or_active-inactive
[12:25:39] [PASSED] inactive-last_or_active-inactive
[12:25:39] [PASSED] inactive-no_or_active-inactive
[12:25:39] ============== [PASSED] xe_rtp_process_tests ===============
[12:25:39] ===================== [PASSED] xe_rtp ======================
[12:25:39] ==================== xe_wa (1 subtest) =====================
[12:25:39] ======================== xe_wa_gt  =========================
[12:25:39] [PASSED] TIGERLAKE B0
[12:25:39] [PASSED] DG1 A0
[12:25:39] [PASSED] DG1 B0
[12:25:39] [PASSED] ALDERLAKE_S A0
[12:25:39] [PASSED] ALDERLAKE_S B0
[12:25:39] [PASSED] ALDERLAKE_S C0
[12:25:39] [PASSED] ALDERLAKE_S D0
[12:25:39] [PASSED] ALDERLAKE_P A0
[12:25:39] [PASSED] ALDERLAKE_P B0
[12:25:39] [PASSED] ALDERLAKE_P C0
[12:25:39] [PASSED] ALDERLAKE_S RPLS D0
[12:25:39] [PASSED] ALDERLAKE_P RPLU E0
[12:25:39] [PASSED] DG2 G10 C0
[12:25:39] [PASSED] DG2 G11 B1
[12:25:39] [PASSED] DG2 G12 A1
[12:25:39] [PASSED] METEORLAKE 12.70(Xe_LPG) A0 13.00(Xe_LPM+) A0
[12:25:39] [PASSED] METEORLAKE 12.71(Xe_LPG) A0 13.00(Xe_LPM+) A0
[12:25:39] [PASSED] METEORLAKE 12.74(Xe_LPG+) A0 13.00(Xe_LPM+) A0
[12:25:39] [PASSED] LUNARLAKE 20.04(Xe2_LPG) A0 20.00(Xe2_LPM) A0
[12:25:39] [PASSED] LUNARLAKE 20.04(Xe2_LPG) B0 20.00(Xe2_LPM) A0
[12:25:39] [PASSED] BATTLEMAGE 20.01(Xe2_HPG) A0 13.01(Xe2_HPM) A1
[12:25:39] [PASSED] PANTHERLAKE 30.00(Xe3_LPG) A0 30.00(Xe3_LPM) A0
[12:25:39] ==================== [PASSED] xe_wa_gt =====================
[12:25:39] ====================== [PASSED] xe_wa ======================
[12:25:39] ============================================================
[12:25:39] Testing complete. Ran 597 tests: passed: 579, skipped: 18
[12:25:39] Elapsed time: 48.905s total, 4.176s configuring, 44.111s building, 0.607s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/tests/.kunitconfig
[12:25:40] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[12:25:41] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[12:26:16] Starting KUnit Kernel (1/1)...
[12:26:16] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[12:26:16] ============ drm_test_pick_cmdline (2 subtests) ============
[12:26:16] [PASSED] drm_test_pick_cmdline_res_1920_1080_60
[12:26:16] =============== drm_test_pick_cmdline_named  ===============
[12:26:16] [PASSED] NTSC
[12:26:16] [PASSED] NTSC-J
[12:26:16] [PASSED] PAL
[12:26:16] [PASSED] PAL-M
[12:26:16] =========== [PASSED] drm_test_pick_cmdline_named ===========
[12:26:16] ============== [PASSED] drm_test_pick_cmdline ==============
[12:26:16] == drm_test_atomic_get_connector_for_encoder (1 subtest) ===
[12:26:16] [PASSED] drm_test_drm_atomic_get_connector_for_encoder
[12:26:16] ==== [PASSED] drm_test_atomic_get_connector_for_encoder ====
[12:26:16] =========== drm_validate_clone_mode (2 subtests) ===========
[12:26:16] ============== drm_test_check_in_clone_mode  ===============
[12:26:16] [PASSED] in_clone_mode
[12:26:16] [PASSED] not_in_clone_mode
[12:26:16] ========== [PASSED] drm_test_check_in_clone_mode ===========
[12:26:16] =============== drm_test_check_valid_clones  ===============
[12:26:16] [PASSED] not_in_clone_mode
[12:26:16] [PASSED] valid_clone
[12:26:16] [PASSED] invalid_clone
[12:26:16] =========== [PASSED] drm_test_check_valid_clones ===========
[12:26:16] ============= [PASSED] drm_validate_clone_mode =============
[12:26:16] ============= drm_validate_modeset (1 subtest) =============
[12:26:16] [PASSED] drm_test_check_connector_changed_modeset
[12:26:16] ============== [PASSED] drm_validate_modeset ===============
[12:26:16] ====== drm_test_bridge_get_current_state (2 subtests) ======
[12:26:16] [PASSED] drm_test_drm_bridge_get_current_state_atomic
[12:26:16] [PASSED] drm_test_drm_bridge_get_current_state_legacy
[12:26:16] ======== [PASSED] drm_test_bridge_get_current_state ========
[12:26:16] ====== drm_test_bridge_helper_reset_crtc (3 subtests) ======
[12:26:16] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic
[12:26:16] [PASSED] drm_test_drm_bridge_helper_reset_crtc_atomic_disabled
[12:26:16] [PASSED] drm_test_drm_bridge_helper_reset_crtc_legacy
[12:26:16] ======== [PASSED] drm_test_bridge_helper_reset_crtc ========
[12:26:16] ============== drm_bridge_alloc (2 subtests) ===============
[12:26:16] [PASSED] drm_test_drm_bridge_alloc_basic
[12:26:16] [PASSED] drm_test_drm_bridge_alloc_get_put
[12:26:16] ================ [PASSED] drm_bridge_alloc =================
[12:26:16] ============= drm_cmdline_parser (40 subtests) =============
[12:26:16] [PASSED] drm_test_cmdline_force_d_only
[12:26:16] [PASSED] drm_test_cmdline_force_D_only_dvi
[12:26:16] [PASSED] drm_test_cmdline_force_D_only_hdmi
[12:26:16] [PASSED] drm_test_cmdline_force_D_only_not_digital
[12:26:16] [PASSED] drm_test_cmdline_force_e_only
[12:26:16] [PASSED] drm_test_cmdline_res
[12:26:16] [PASSED] drm_test_cmdline_res_vesa
[12:26:16] [PASSED] drm_test_cmdline_res_vesa_rblank
[12:26:16] [PASSED] drm_test_cmdline_res_rblank
[12:26:16] [PASSED] drm_test_cmdline_res_bpp
[12:26:16] [PASSED] drm_test_cmdline_res_refresh
[12:26:16] [PASSED] drm_test_cmdline_res_bpp_refresh
[12:26:16] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced
[12:26:16] [PASSED] drm_test_cmdline_res_bpp_refresh_margins
[12:26:16] [PASSED] drm_test_cmdline_res_bpp_refresh_force_off
[12:26:16] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on
[12:26:16] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_analog
[12:26:16] [PASSED] drm_test_cmdline_res_bpp_refresh_force_on_digital
[12:26:16] [PASSED] drm_test_cmdline_res_bpp_refresh_interlaced_margins_force_on
[12:26:16] [PASSED] drm_test_cmdline_res_margins_force_on
[12:26:16] [PASSED] drm_test_cmdline_res_vesa_margins
[12:26:16] [PASSED] drm_test_cmdline_name
[12:26:16] [PASSED] drm_test_cmdline_name_bpp
[12:26:16] [PASSED] drm_test_cmdline_name_option
[12:26:16] [PASSED] drm_test_cmdline_name_bpp_option
[12:26:16] [PASSED] drm_test_cmdline_rotate_0
[12:26:16] [PASSED] drm_test_cmdline_rotate_90
[12:26:16] [PASSED] drm_test_cmdline_rotate_180
[12:26:16] [PASSED] drm_test_cmdline_rotate_270
[12:26:16] [PASSED] drm_test_cmdline_hmirror
[12:26:16] [PASSED] drm_test_cmdline_vmirror
[12:26:16] [PASSED] drm_test_cmdline_margin_options
[12:26:16] [PASSED] drm_test_cmdline_multiple_options
[12:26:16] [PASSED] drm_test_cmdline_bpp_extra_and_option
[12:26:16] [PASSED] drm_test_cmdline_extra_and_option
[12:26:16] [PASSED] drm_test_cmdline_freestanding_options
[12:26:16] [PASSED] drm_test_cmdline_freestanding_force_e_and_options
[12:26:16] [PASSED] drm_test_cmdline_panel_orientation
[12:26:16] ================ drm_test_cmdline_invalid  =================
[12:26:16] [PASSED] margin_only
[12:26:16] [PASSED] interlace_only
[12:26:16] [PASSED] res_missing_x
[12:26:16] [PASSED] res_missing_y
[12:26:16] [PASSED] res_bad_y
[12:26:16] [PASSED] res_missing_y_bpp
[12:26:16] [PASSED] res_bad_bpp
[12:26:16] [PASSED] res_bad_refresh
[12:26:16] [PASSED] res_bpp_refresh_force_on_off
[12:26:16] [PASSED] res_invalid_mode
[12:26:16] [PASSED] res_bpp_wrong_place_mode
[12:26:16] [PASSED] name_bpp_refresh
[12:26:16] [PASSED] name_refresh
[12:26:16] [PASSED] name_refresh_wrong_mode
[12:26:16] [PASSED] name_refresh_invalid_mode
[12:26:16] [PASSED] rotate_multiple
[12:26:16] [PASSED] rotate_invalid_val
[12:26:16] [PASSED] rotate_truncated
[12:26:16] [PASSED] invalid_option
[12:26:16] [PASSED] invalid_tv_option
[12:26:16] [PASSED] truncated_tv_option
[12:26:16] ============ [PASSED] drm_test_cmdline_invalid =============
[12:26:16] =============== drm_test_cmdline_tv_options  ===============
[12:26:16] [PASSED] NTSC
[12:26:16] [PASSED] NTSC_443
[12:26:16] [PASSED] NTSC_J
[12:26:16] [PASSED] PAL
[12:26:16] [PASSED] PAL_M
[12:26:16] [PASSED] PAL_N
[12:26:16] [PASSED] SECAM
[12:26:16] [PASSED] MONO_525
[12:26:16] [PASSED] MONO_625
[12:26:16] =========== [PASSED] drm_test_cmdline_tv_options ===========
[12:26:16] =============== [PASSED] drm_cmdline_parser ================
[12:26:16] ========== drmm_connector_hdmi_init (20 subtests) ==========
[12:26:16] [PASSED] drm_test_connector_hdmi_init_valid
[12:26:16] [PASSED] drm_test_connector_hdmi_init_bpc_8
[12:26:16] [PASSED] drm_test_connector_hdmi_init_bpc_10
[12:26:16] [PASSED] drm_test_connector_hdmi_init_bpc_12
[12:26:16] [PASSED] drm_test_connector_hdmi_init_bpc_invalid
[12:26:16] [PASSED] drm_test_connector_hdmi_init_bpc_null
[12:26:16] [PASSED] drm_test_connector_hdmi_init_formats_empty
[12:26:16] [PASSED] drm_test_connector_hdmi_init_formats_no_rgb
[12:26:16] === drm_test_connector_hdmi_init_formats_yuv420_allowed  ===
[12:26:16] [PASSED] supported_formats=0x9 yuv420_allowed=1
[12:26:16] [PASSED] supported_formats=0x9 yuv420_allowed=0
[12:26:16] [PASSED] supported_formats=0x5 yuv420_allowed=1
[12:26:16] [PASSED] supported_formats=0x5 yuv420_allowed=0
[12:26:16] === [PASSED] drm_test_connector_hdmi_init_formats_yuv420_allowed ===
[12:26:16] [PASSED] drm_test_connector_hdmi_init_null_ddc
[12:26:16] [PASSED] drm_test_connector_hdmi_init_null_product
[12:26:16] [PASSED] drm_test_connector_hdmi_init_null_vendor
[12:26:16] [PASSED] drm_test_connector_hdmi_init_product_length_exact
[12:26:16] [PASSED] drm_test_connector_hdmi_init_product_length_too_long
[12:26:16] [PASSED] drm_test_connector_hdmi_init_product_valid
[12:26:16] [PASSED] drm_test_connector_hdmi_init_vendor_length_exact
[12:26:16] [PASSED] drm_test_connector_hdmi_init_vendor_length_too_long
[12:26:16] [PASSED] drm_test_connector_hdmi_init_vendor_valid
[12:26:16] ========= drm_test_connector_hdmi_init_type_valid  =========
[12:26:16] [PASSED] HDMI-A
[12:26:16] [PASSED] HDMI-B
[12:26:16] ===== [PASSED] drm_test_connector_hdmi_init_type_valid =====
[12:26:16] ======== drm_test_connector_hdmi_init_type_invalid  ========
[12:26:16] [PASSED] Unknown
[12:26:16] [PASSED] VGA
[12:26:16] [PASSED] DVI-I
[12:26:16] [PASSED] DVI-D
[12:26:16] [PASSED] DVI-A
[12:26:16] [PASSED] Composite
[12:26:16] [PASSED] SVIDEO
[12:26:16] [PASSED] LVDS
[12:26:16] [PASSED] Component
[12:26:16] [PASSED] DIN
[12:26:16] [PASSED] DP
[12:26:16] [PASSED] TV
[12:26:16] [PASSED] eDP
[12:26:16] [PASSED] Virtual
[12:26:16] [PASSED] DSI
[12:26:16] [PASSED] DPI
[12:26:16] [PASSED] Writeback
[12:26:16] [PASSED] SPI
[12:26:16] [PASSED] USB
[12:26:16] ==== [PASSED] drm_test_connector_hdmi_init_type_invalid ====
[12:26:16] ============ [PASSED] drmm_connector_hdmi_init =============
[12:26:16] ============= drmm_connector_init (3 subtests) =============
[12:26:16] [PASSED] drm_test_drmm_connector_init
[12:26:16] [PASSED] drm_test_drmm_connector_init_null_ddc
[12:26:16] ========= drm_test_drmm_connector_init_type_valid  =========
[12:26:16] [PASSED] Unknown
[12:26:16] [PASSED] VGA
[12:26:16] [PASSED] DVI-I
[12:26:16] [PASSED] DVI-D
[12:26:16] [PASSED] DVI-A
[12:26:16] [PASSED] Composite
[12:26:16] [PASSED] SVIDEO
[12:26:16] [PASSED] LVDS
[12:26:16] [PASSED] Component
[12:26:16] [PASSED] DIN
[12:26:16] [PASSED] DP
[12:26:16] [PASSED] HDMI-A
[12:26:16] [PASSED] HDMI-B
[12:26:16] [PASSED] TV
[12:26:16] [PASSED] eDP
[12:26:16] [PASSED] Virtual
[12:26:16] [PASSED] DSI
[12:26:16] [PASSED] DPI
[12:26:16] [PASSED] Writeback
[12:26:16] [PASSED] SPI
[12:26:16] [PASSED] USB
[12:26:16] ===== [PASSED] drm_test_drmm_connector_init_type_valid =====
[12:26:16] =============== [PASSED] drmm_connector_init ===============
[12:26:16] ========= drm_connector_dynamic_init (6 subtests) ==========
[12:26:16] [PASSED] drm_test_drm_connector_dynamic_init
[12:26:16] [PASSED] drm_test_drm_connector_dynamic_init_null_ddc
[12:26:16] [PASSED] drm_test_drm_connector_dynamic_init_not_added
[12:26:16] [PASSED] drm_test_drm_connector_dynamic_init_properties
[12:26:16] ===== drm_test_drm_connector_dynamic_init_type_valid  ======
[12:26:16] [PASSED] Unknown
[12:26:16] [PASSED] VGA
[12:26:16] [PASSED] DVI-I
[12:26:16] [PASSED] DVI-D
[12:26:16] [PASSED] DVI-A
[12:26:16] [PASSED] Composite
[12:26:16] [PASSED] SVIDEO
[12:26:16] [PASSED] LVDS
[12:26:16] [PASSED] Component
[12:26:16] [PASSED] DIN
[12:26:16] [PASSED] DP
[12:26:16] [PASSED] HDMI-A
[12:26:16] [PASSED] HDMI-B
[12:26:16] [PASSED] TV
[12:26:16] [PASSED] eDP
[12:26:16] [PASSED] Virtual
[12:26:16] [PASSED] DSI
[12:26:16] [PASSED] DPI
[12:26:16] [PASSED] Writeback
[12:26:16] [PASSED] SPI
[12:26:16] [PASSED] USB
[12:26:16] = [PASSED] drm_test_drm_connector_dynamic_init_type_valid ==
[12:26:16] ======== drm_test_drm_connector_dynamic_init_name  =========
[12:26:16] [PASSED] Unknown
[12:26:16] [PASSED] VGA
[12:26:16] [PASSED] DVI-I
[12:26:16] [PASSED] DVI-D
[12:26:16] [PASSED] DVI-A
[12:26:17] [PASSED] Composite
[12:26:17] [PASSED] SVIDEO
[12:26:17] [PASSED] LVDS
[12:26:17] [PASSED] Component
[12:26:17] [PASSED] DIN
[12:26:17] [PASSED] DP
[12:26:17] [PASSED] HDMI-A
[12:26:17] [PASSED] HDMI-B
[12:26:17] [PASSED] TV
[12:26:17] [PASSED] eDP
[12:26:17] [PASSED] Virtual
[12:26:17] [PASSED] DSI
[12:26:17] [PASSED] DPI
[12:26:17] [PASSED] Writeback
[12:26:17] [PASSED] SPI
[12:26:17] [PASSED] USB
[12:26:17] ==== [PASSED] drm_test_drm_connector_dynamic_init_name =====
[12:26:17] =========== [PASSED] drm_connector_dynamic_init ============
[12:26:17] ==== drm_connector_dynamic_register_early (4 subtests) =====
[12:26:17] [PASSED] drm_test_drm_connector_dynamic_register_early_on_list
[12:26:17] [PASSED] drm_test_drm_connector_dynamic_register_early_defer
[12:26:17] [PASSED] drm_test_drm_connector_dynamic_register_early_no_init
[12:26:17] [PASSED] drm_test_drm_connector_dynamic_register_early_no_mode_object
[12:26:17] ====== [PASSED] drm_connector_dynamic_register_early =======
[12:26:17] ======= drm_connector_dynamic_register (7 subtests) ========
[12:26:17] [PASSED] drm_test_drm_connector_dynamic_register_on_list
[12:26:17] [PASSED] drm_test_drm_connector_dynamic_register_no_defer
[12:26:17] [PASSED] drm_test_drm_connector_dynamic_register_no_init
[12:26:17] [PASSED] drm_test_drm_connector_dynamic_register_mode_object
[12:26:17] [PASSED] drm_test_drm_connector_dynamic_register_sysfs
[12:26:17] [PASSED] drm_test_drm_connector_dynamic_register_sysfs_name
[12:26:17] [PASSED] drm_test_drm_connector_dynamic_register_debugfs
[12:26:17] ========= [PASSED] drm_connector_dynamic_register ==========
[12:26:17] = drm_connector_attach_broadcast_rgb_property (2 subtests) =
[12:26:17] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property
[12:26:17] [PASSED] drm_test_drm_connector_attach_broadcast_rgb_property_hdmi_connector
[12:26:17] === [PASSED] drm_connector_attach_broadcast_rgb_property ===
[12:26:17] ========== drm_get_tv_mode_from_name (2 subtests) ==========
[12:26:17] ========== drm_test_get_tv_mode_from_name_valid  ===========
[12:26:17] [PASSED] NTSC
[12:26:17] [PASSED] NTSC-443
[12:26:17] [PASSED] NTSC-J
[12:26:17] [PASSED] PAL
[12:26:17] [PASSED] PAL-M
[12:26:17] [PASSED] PAL-N
[12:26:17] [PASSED] SECAM
[12:26:17] [PASSED] Mono
[12:26:17] ====== [PASSED] drm_test_get_tv_mode_from_name_valid =======
[12:26:17] [PASSED] drm_test_get_tv_mode_from_name_truncated
[12:26:17] ============ [PASSED] drm_get_tv_mode_from_name ============
[12:26:17] = drm_test_connector_hdmi_compute_mode_clock (12 subtests) =
[12:26:17] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb
[12:26:17] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc
[12:26:17] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_10bpc_vic_1
[12:26:17] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc
[12:26:17] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_12bpc_vic_1
[12:26:17] [PASSED] drm_test_drm_hdmi_compute_mode_clock_rgb_double
[12:26:17] = drm_test_connector_hdmi_compute_mode_clock_yuv420_valid  =
[12:26:17] [PASSED] VIC 96
[12:26:17] [PASSED] VIC 97
[12:26:17] [PASSED] VIC 101
[12:26:17] [PASSED] VIC 102
[12:26:17] [PASSED] VIC 106
[12:26:17] [PASSED] VIC 107
[12:26:17] === [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_valid ===
[12:26:17] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_10_bpc
[12:26:17] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv420_12_bpc
[12:26:17] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_8_bpc
[12:26:17] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_10_bpc
[12:26:17] [PASSED] drm_test_connector_hdmi_compute_mode_clock_yuv422_12_bpc
[12:26:17] === [PASSED] drm_test_connector_hdmi_compute_mode_clock ====
[12:26:17] == drm_hdmi_connector_get_broadcast_rgb_name (2 subtests) ==
[12:26:17] === drm_test_drm_hdmi_connector_get_broadcast_rgb_name  ====
[12:26:17] [PASSED] Automatic
[12:26:17] [PASSED] Full
[12:26:17] [PASSED] Limited 16:235
[12:26:17] === [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name ===
[12:26:17] [PASSED] drm_test_drm_hdmi_connector_get_broadcast_rgb_name_invalid
[12:26:17] ==== [PASSED] drm_hdmi_connector_get_broadcast_rgb_name ====
[12:26:17] == drm_hdmi_connector_get_output_format_name (2 subtests) ==
[12:26:17] === drm_test_drm_hdmi_connector_get_output_format_name  ====
[12:26:17] [PASSED] RGB
[12:26:17] [PASSED] YUV 4:2:0
[12:26:17] [PASSED] YUV 4:2:2
[12:26:17] [PASSED] YUV 4:4:4
[12:26:17] === [PASSED] drm_test_drm_hdmi_connector_get_output_format_name ===
[12:26:17] [PASSED] drm_test_drm_hdmi_connector_get_output_format_name_invalid
[12:26:17] ==== [PASSED] drm_hdmi_connector_get_output_format_name ====
[12:26:17] ============= drm_damage_helper (21 subtests) ==============
[12:26:17] [PASSED] drm_test_damage_iter_no_damage
[12:26:17] [PASSED] drm_test_damage_iter_no_damage_fractional_src
[12:26:17] [PASSED] drm_test_damage_iter_no_damage_src_moved
[12:26:17] [PASSED] drm_test_damage_iter_no_damage_fractional_src_moved
[12:26:17] [PASSED] drm_test_damage_iter_no_damage_not_visible
[12:26:17] [PASSED] drm_test_damage_iter_no_damage_no_crtc
[12:26:17] [PASSED] drm_test_damage_iter_no_damage_no_fb
[12:26:17] [PASSED] drm_test_damage_iter_simple_damage
[12:26:17] [PASSED] drm_test_damage_iter_single_damage
[12:26:17] [PASSED] drm_test_damage_iter_single_damage_intersect_src
[12:26:17] [PASSED] drm_test_damage_iter_single_damage_outside_src
[12:26:17] [PASSED] drm_test_damage_iter_single_damage_fractional_src
[12:26:17] [PASSED] drm_test_damage_iter_single_damage_intersect_fractional_src
[12:26:17] [PASSED] drm_test_damage_iter_single_damage_outside_fractional_src
[12:26:17] [PASSED] drm_test_damage_iter_single_damage_src_moved
[12:26:17] [PASSED] drm_test_damage_iter_single_damage_fractional_src_moved
[12:26:17] [PASSED] drm_test_damage_iter_damage
[12:26:17] [PASSED] drm_test_damage_iter_damage_one_intersect
[12:26:17] [PASSED] drm_test_damage_iter_damage_one_outside
[12:26:17] [PASSED] drm_test_damage_iter_damage_src_moved
[12:26:17] [PASSED] drm_test_damage_iter_damage_not_visible
[12:26:17] ================ [PASSED] drm_damage_helper ================
[12:26:17] ============== drm_dp_mst_helper (3 subtests) ==============
[12:26:17] ============== drm_test_dp_mst_calc_pbn_mode  ==============
[12:26:17] [PASSED] Clock 154000 BPP 30 DSC disabled
[12:26:17] [PASSED] Clock 234000 BPP 30 DSC disabled
[12:26:17] [PASSED] Clock 297000 BPP 24 DSC disabled
[12:26:17] [PASSED] Clock 332880 BPP 24 DSC enabled
[12:26:17] [PASSED] Clock 324540 BPP 24 DSC enabled
[12:26:17] ========== [PASSED] drm_test_dp_mst_calc_pbn_mode ==========
[12:26:17] ============== drm_test_dp_mst_calc_pbn_div  ===============
[12:26:17] [PASSED] Link rate 2000000 lane count 4
[12:26:17] [PASSED] Link rate 2000000 lane count 2
[12:26:17] [PASSED] Link rate 2000000 lane count 1
[12:26:17] [PASSED] Link rate 1350000 lane count 4
[12:26:17] [PASSED] Link rate 1350000 lane count 2
[12:26:17] [PASSED] Link rate 1350000 lane count 1
[12:26:17] [PASSED] Link rate 1000000 lane count 4
[12:26:17] [PASSED] Link rate 1000000 lane count 2
[12:26:17] [PASSED] Link rate 1000000 lane count 1
[12:26:17] [PASSED] Link rate 810000 lane count 4
[12:26:17] [PASSED] Link rate 810000 lane count 2
[12:26:17] [PASSED] Link rate 810000 lane count 1
[12:26:17] [PASSED] Link rate 540000 lane count 4
[12:26:17] [PASSED] Link rate 540000 lane count 2
[12:26:17] [PASSED] Link rate 540000 lane count 1
[12:26:17] [PASSED] Link rate 270000 lane count 4
[12:26:17] [PASSED] Link rate 270000 lane count 2
[12:26:17] [PASSED] Link rate 270000 lane count 1
[12:26:17] [PASSED] Link rate 162000 lane count 4
[12:26:17] [PASSED] Link rate 162000 lane count 2
[12:26:17] [PASSED] Link rate 162000 lane count 1
[12:26:17] ========== [PASSED] drm_test_dp_mst_calc_pbn_div ===========
[12:26:17] ========= drm_test_dp_mst_sideband_msg_req_decode  =========
[12:26:17] [PASSED] DP_ENUM_PATH_RESOURCES with port number
[12:26:17] [PASSED] DP_POWER_UP_PHY with port number
[12:26:17] [PASSED] DP_POWER_DOWN_PHY with port number
[12:26:17] [PASSED] DP_ALLOCATE_PAYLOAD with SDP stream sinks
[12:26:17] [PASSED] DP_ALLOCATE_PAYLOAD with port number
[12:26:17] [PASSED] DP_ALLOCATE_PAYLOAD with VCPI
[12:26:17] [PASSED] DP_ALLOCATE_PAYLOAD with PBN
[12:26:17] [PASSED] DP_QUERY_PAYLOAD with port number
[12:26:17] [PASSED] DP_QUERY_PAYLOAD with VCPI
[12:26:17] [PASSED] DP_REMOTE_DPCD_READ with port number
[12:26:17] [PASSED] DP_REMOTE_DPCD_READ with DPCD address
[12:26:17] [PASSED] DP_REMOTE_DPCD_READ with max number of bytes
[12:26:17] [PASSED] DP_REMOTE_DPCD_WRITE with port number
[12:26:17] [PASSED] DP_REMOTE_DPCD_WRITE with DPCD address
[12:26:17] [PASSED] DP_REMOTE_DPCD_WRITE with data array
[12:26:17] [PASSED] DP_REMOTE_I2C_READ with port number
[12:26:17] [PASSED] DP_REMOTE_I2C_READ with I2C device ID
[12:26:17] [PASSED] DP_REMOTE_I2C_READ with transactions array
[12:26:17] [PASSED] DP_REMOTE_I2C_WRITE with port number
[12:26:17] [PASSED] DP_REMOTE_I2C_WRITE with I2C device ID
[12:26:17] [PASSED] DP_REMOTE_I2C_WRITE with data array
[12:26:17] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream ID
[12:26:17] [PASSED] DP_QUERY_STREAM_ENC_STATUS with client ID
[12:26:17] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream event
[12:26:17] [PASSED] DP_QUERY_STREAM_ENC_STATUS with valid stream event
[12:26:17] [PASSED] DP_QUERY_STREAM_ENC_STATUS with stream behavior
[12:26:17] [PASSED] DP_QUERY_STREAM_ENC_STATUS with a valid stream behavior
[12:26:17] ===== [PASSED] drm_test_dp_mst_sideband_msg_req_decode =====
[12:26:17] ================ [PASSED] drm_dp_mst_helper ================
[12:26:17] ================== drm_exec (7 subtests) ===================
[12:26:17] [PASSED] sanitycheck
[12:26:17] [PASSED] test_lock
[12:26:17] [PASSED] test_lock_unlock
[12:26:17] [PASSED] test_duplicates
[12:26:17] [PASSED] test_prepare
[12:26:17] [PASSED] test_prepare_array
[12:26:17] [PASSED] test_multiple_loops
[12:26:17] ==================== [PASSED] drm_exec =====================
[12:26:17] =========== drm_format_helper_test (17 subtests) ===========
[12:26:17] ============== drm_test_fb_xrgb8888_to_gray8  ==============
[12:26:17] [PASSED] single_pixel_source_buffer
[12:26:17] [PASSED] single_pixel_clip_rectangle
[12:26:17] [PASSED] well_known_colors
[12:26:17] [PASSED] destination_pitch
[12:26:17] ========== [PASSED] drm_test_fb_xrgb8888_to_gray8 ==========
[12:26:17] ============= drm_test_fb_xrgb8888_to_rgb332  ==============
[12:26:17] [PASSED] single_pixel_source_buffer
[12:26:17] [PASSED] single_pixel_clip_rectangle
[12:26:17] [PASSED] well_known_colors
[12:26:17] [PASSED] destination_pitch
[12:26:17] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb332 ==========
[12:26:17] ============= drm_test_fb_xrgb8888_to_rgb565  ==============
[12:26:17] [PASSED] single_pixel_source_buffer
[12:26:17] [PASSED] single_pixel_clip_rectangle
[12:26:17] [PASSED] well_known_colors
[12:26:17] [PASSED] destination_pitch
[12:26:17] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb565 ==========
[12:26:17] ============ drm_test_fb_xrgb8888_to_xrgb1555  =============
[12:26:17] [PASSED] single_pixel_source_buffer
[12:26:17] [PASSED] single_pixel_clip_rectangle
[12:26:17] [PASSED] well_known_colors
[12:26:17] [PASSED] destination_pitch
[12:26:17] ======== [PASSED] drm_test_fb_xrgb8888_to_xrgb1555 =========
[12:26:17] ============ drm_test_fb_xrgb8888_to_argb1555  =============
[12:26:17] [PASSED] single_pixel_source_buffer
[12:26:17] [PASSED] single_pixel_clip_rectangle
[12:26:17] [PASSED] well_known_colors
[12:26:17] [PASSED] destination_pitch
[12:26:17] ======== [PASSED] drm_test_fb_xrgb8888_to_argb1555 =========
[12:26:17] ============ drm_test_fb_xrgb8888_to_rgba5551  =============
[12:26:17] [PASSED] single_pixel_source_buffer
[12:26:17] [PASSED] single_pixel_clip_rectangle
[12:26:17] [PASSED] well_known_colors
[12:26:17] [PASSED] destination_pitch
[12:26:17] ======== [PASSED] drm_test_fb_xrgb8888_to_rgba5551 =========
[12:26:17] ============= drm_test_fb_xrgb8888_to_rgb888  ==============
[12:26:17] [PASSED] single_pixel_source_buffer
[12:26:17] [PASSED] single_pixel_clip_rectangle
[12:26:17] [PASSED] well_known_colors
[12:26:17] [PASSED] destination_pitch
[12:26:17] ========= [PASSED] drm_test_fb_xrgb8888_to_rgb888 ==========
[12:26:17] ============= drm_test_fb_xrgb8888_to_bgr888  ==============
[12:26:17] [PASSED] single_pixel_source_buffer
[12:26:17] [PASSED] single_pixel_clip_rectangle
[12:26:17] [PASSED] well_known_colors
[12:26:17] [PASSED] destination_pitch
[12:26:17] ========= [PASSED] drm_test_fb_xrgb8888_to_bgr888 ==========
[12:26:17] ============ drm_test_fb_xrgb8888_to_argb8888  =============
[12:26:17] [PASSED] single_pixel_source_buffer
[12:26:17] [PASSED] single_pixel_clip_rectangle
[12:26:17] [PASSED] well_known_colors
[12:26:17] [PASSED] destination_pitch
[12:26:17] ======== [PASSED] drm_test_fb_xrgb8888_to_argb8888 =========
[12:26:17] =========== drm_test_fb_xrgb8888_to_xrgb2101010  ===========
[12:26:17] [PASSED] single_pixel_source_buffer
[12:26:17] [PASSED] single_pixel_clip_rectangle
[12:26:17] [PASSED] well_known_colors
[12:26:17] [PASSED] destination_pitch
[12:26:17] ======= [PASSED] drm_test_fb_xrgb8888_to_xrgb2101010 =======
[12:26:17] =========== drm_test_fb_xrgb8888_to_argb2101010  ===========
[12:26:17] [PASSED] single_pixel_source_buffer
[12:26:17] [PASSED] single_pixel_clip_rectangle
[12:26:17] [PASSED] well_known_colors
[12:26:17] [PASSED] destination_pitch
[12:26:17] ======= [PASSED] drm_test_fb_xrgb8888_to_argb2101010 =======
[12:26:17] ============== drm_test_fb_xrgb8888_to_mono  ===============
[12:26:17] [PASSED] single_pixel_source_buffer
[12:26:17] [PASSED] single_pixel_clip_rectangle
[12:26:17] [PASSED] well_known_colors
[12:26:17] [PASSED] destination_pitch
[12:26:17] ========== [PASSED] drm_test_fb_xrgb8888_to_mono ===========
[12:26:17] ==================== drm_test_fb_swab  =====================
[12:26:17] [PASSED] single_pixel_source_buffer
[12:26:17] [PASSED] single_pixel_clip_rectangle
[12:26:17] [PASSED] well_known_colors
[12:26:17] [PASSED] destination_pitch
[12:26:17] ================ [PASSED] drm_test_fb_swab =================
[12:26:17] ============ drm_test_fb_xrgb8888_to_xbgr8888  =============
[12:26:17] [PASSED] single_pixel_source_buffer
[12:26:17] [PASSED] single_pixel_clip_rectangle
[12:26:17] [PASSED] well_known_colors
[12:26:17] [PASSED] destination_pitch
[12:26:17] ======== [PASSED] drm_test_fb_xrgb8888_to_xbgr8888 =========
[12:26:17] ============ drm_test_fb_xrgb8888_to_abgr8888  =============
[12:26:17] [PASSED] single_pixel_source_buffer
[12:26:17] [PASSED] single_pixel_clip_rectangle
[12:26:17] [PASSED] well_known_colors
[12:26:17] [PASSED] destination_pitch
[12:26:17] ======== [PASSED] drm_test_fb_xrgb8888_to_abgr8888 =========
[12:26:17] ================= drm_test_fb_clip_offset  =================
[12:26:17] [PASSED] pass through
[12:26:17] [PASSED] horizontal offset
[12:26:17] [PASSED] vertical offset
[12:26:17] [PASSED] horizontal and vertical offset
[12:26:17] [PASSED] horizontal offset (custom pitch)
[12:26:17] [PASSED] vertical offset (custom pitch)
[12:26:17] [PASSED] horizontal and vertical offset (custom pitch)
[12:26:17] ============= [PASSED] drm_test_fb_clip_offset =============
[12:26:17] =================== drm_test_fb_memcpy  ====================
[12:26:17] [PASSED] single_pixel_source_buffer: XR24 little-endian (0x34325258)
[12:26:17] [PASSED] single_pixel_source_buffer: XRA8 little-endian (0x38415258)
[12:26:17] [PASSED] single_pixel_source_buffer: YU24 little-endian (0x34325559)
[12:26:17] [PASSED] single_pixel_clip_rectangle: XB24 little-endian (0x34324258)
[12:26:17] [PASSED] single_pixel_clip_rectangle: XRA8 little-endian (0x38415258)
[12:26:17] [PASSED] single_pixel_clip_rectangle: YU24 little-endian (0x34325559)
[12:26:17] [PASSED] well_known_colors: XB24 little-endian (0x34324258)
[12:26:17] [PASSED] well_known_colors: XRA8 little-endian (0x38415258)
[12:26:17] [PASSED] well_known_colors: YU24 little-endian (0x34325559)
[12:26:17] [PASSED] destination_pitch: XB24 little-endian (0x34324258)
[12:26:17] [PASSED] destination_pitch: XRA8 little-endian (0x38415258)
[12:26:17] [PASSED] destination_pitch: YU24 little-endian (0x34325559)
[12:26:17] =============== [PASSED] drm_test_fb_memcpy ================
[12:26:17] ============= [PASSED] drm_format_helper_test ==============
[12:26:17] ================= drm_format (18 subtests) =================
[12:26:17] [PASSED] drm_test_format_block_width_invalid
[12:26:17] [PASSED] drm_test_format_block_width_one_plane
[12:26:17] [PASSED] drm_test_format_block_width_two_plane
[12:26:17] [PASSED] drm_test_format_block_width_three_plane
[12:26:17] [PASSED] drm_test_format_block_width_tiled
[12:26:17] [PASSED] drm_test_format_block_height_invalid
[12:26:17] [PASSED] drm_test_format_block_height_one_plane
[12:26:17] [PASSED] drm_test_format_block_height_two_plane
[12:26:17] [PASSED] drm_test_format_block_height_three_plane
[12:26:17] [PASSED] drm_test_format_block_height_tiled
[12:26:17] [PASSED] drm_test_format_min_pitch_invalid
[12:26:17] [PASSED] drm_test_format_min_pitch_one_plane_8bpp
[12:26:17] [PASSED] drm_test_format_min_pitch_one_plane_16bpp
[12:26:17] [PASSED] drm_test_format_min_pitch_one_plane_24bpp
[12:26:17] [PASSED] drm_test_format_min_pitch_one_plane_32bpp
[12:26:17] [PASSED] drm_test_format_min_pitch_two_plane
[12:26:17] [PASSED] drm_test_format_min_pitch_three_plane_8bpp
[12:26:17] [PASSED] drm_test_format_min_pitch_tiled
[12:26:17] =================== [PASSED] drm_format ====================
[12:26:17] ============== drm_framebuffer (10 subtests) ===============
[12:26:17] ========== drm_test_framebuffer_check_src_coords  ==========
[12:26:17] [PASSED] Success: source fits into fb
[12:26:17] [PASSED] Fail: overflowing fb with x-axis coordinate
[12:26:17] [PASSED] Fail: overflowing fb with y-axis coordinate
[12:26:17] [PASSED] Fail: overflowing fb with source width
[12:26:17] [PASSED] Fail: overflowing fb with source height
[12:26:17] ====== [PASSED] drm_test_framebuffer_check_src_coords ======
[12:26:17] [PASSED] drm_test_framebuffer_cleanup
[12:26:17] =============== drm_test_framebuffer_create  ===============
[12:26:17] [PASSED] ABGR8888 normal sizes
[12:26:17] [PASSED] ABGR8888 max sizes
[12:26:17] [PASSED] ABGR8888 pitch greater than min required
[12:26:17] [PASSED] ABGR8888 pitch less than min required
[12:26:17] [PASSED] ABGR8888 Invalid width
[12:26:17] [PASSED] ABGR8888 Invalid buffer handle
[12:26:17] [PASSED] No pixel format
[12:26:17] [PASSED] ABGR8888 Width 0
[12:26:17] [PASSED] ABGR8888 Height 0
[12:26:17] [PASSED] ABGR8888 Out of bound height * pitch combination
[12:26:17] [PASSED] ABGR8888 Large buffer offset
[12:26:17] [PASSED] ABGR8888 Buffer offset for inexistent plane
[12:26:17] [PASSED] ABGR8888 Invalid flag
[12:26:17] [PASSED] ABGR8888 Set DRM_MODE_FB_MODIFIERS without modifiers
[12:26:17] [PASSED] ABGR8888 Valid buffer modifier
[12:26:17] [PASSED] ABGR8888 Invalid buffer modifier(DRM_FORMAT_MOD_SAMSUNG_64_32_TILE)
[12:26:17] [PASSED] ABGR8888 Extra pitches without DRM_MODE_FB_MODIFIERS
[12:26:17] [PASSED] ABGR8888 Extra pitches with DRM_MODE_FB_MODIFIERS
[12:26:17] [PASSED] NV12 Normal sizes
[12:26:17] [PASSED] NV12 Max sizes
[12:26:17] [PASSED] NV12 Invalid pitch
[12:26:17] [PASSED] NV12 Invalid modifier/missing DRM_MODE_FB_MODIFIERS flag
[12:26:17] [PASSED] NV12 different  modifier per-plane
[12:26:17] [PASSED] NV12 with DRM_FORMAT_MOD_SAMSUNG_64_32_TILE
[12:26:17] [PASSED] NV12 Valid modifiers without DRM_MODE_FB_MODIFIERS
[12:26:17] [PASSED] NV12 Modifier for inexistent plane
[12:26:17] [PASSED] NV12 Handle for inexistent plane
[12:26:17] [PASSED] NV12 Handle for inexistent plane without DRM_MODE_FB_MODIFIERS
[12:26:17] [PASSED] YVU420 DRM_MODE_FB_MODIFIERS set without modifier
[12:26:17] [PASSED] YVU420 Normal sizes
[12:26:17] [PASSED] YVU420 Max sizes
[12:26:17] [PASSED] YVU420 Invalid pitch
[12:26:17] [PASSED] YVU420 Different pitches
[12:26:17] [PASSED] YVU420 Different buffer offsets/pitches
[12:26:17] [PASSED] YVU420 Modifier set just for plane 0, without DRM_MODE_FB_MODIFIERS
[12:26:17] [PASSED] YVU420 Modifier set just for planes 0, 1, without DRM_MODE_FB_MODIFIERS
[12:26:17] [PASSED] YVU420 Modifier set just for plane 0, 1, with DRM_MODE_FB_MODIFIERS
[12:26:17] [PASSED] YVU420 Valid modifier
[12:26:17] [PASSED] YVU420 Different modifiers per plane
[12:26:17] [PASSED] YVU420 Modifier for inexistent plane
[12:26:17] [PASSED] YUV420_10BIT Invalid modifier(DRM_FORMAT_MOD_LINEAR)
[12:26:17] [PASSED] X0L2 Normal sizes
[12:26:17] [PASSED] X0L2 Max sizes
[12:26:17] [PASSED] X0L2 Invalid pitch
[12:26:17] [PASSED] X0L2 Pitch greater than minimum required
[12:26:17] [PASSED] X0L2 Handle for inexistent plane
[12:26:17] [PASSED] X0L2 Offset for inexistent plane, without DRM_MODE_FB_MODIFIERS set
[12:26:17] [PASSED] X0L2 Modifier without DRM_MODE_FB_MODIFIERS set
[12:26:17] [PASSED] X0L2 Valid modifier
[12:26:17] [PASSED] X0L2 Modifier for inexistent plane
[12:26:17] =========== [PASSED] drm_test_framebuffer_create ===========
[12:26:17] [PASSED] drm_test_framebuffer_free
[12:26:17] [PASSED] drm_test_framebuffer_init
[12:26:17] [PASSED] drm_test_framebuffer_init_bad_format
[12:26:17] [PASSED] drm_test_framebuffer_init_dev_mismatch
[12:26:17] [PASSED] drm_test_framebuffer_lookup
[12:26:17] [PASSED] drm_test_framebuffer_lookup_inexistent
[12:26:17] [PASSED] drm_test_framebuffer_modifiers_not_supported
[12:26:17] ================= [PASSED] drm_framebuffer =================
[12:26:17] ================ drm_gem_shmem (8 subtests) ================
[12:26:17] [PASSED] drm_gem_shmem_test_obj_create
[12:26:17] [PASSED] drm_gem_shmem_test_obj_create_private
[12:26:17] [PASSED] drm_gem_shmem_test_pin_pages
[12:26:17] [PASSED] drm_gem_shmem_test_vmap
[12:26:17] [PASSED] drm_gem_shmem_test_get_sg_table
[12:26:17] [PASSED] drm_gem_shmem_test_get_pages_sgt
[12:26:17] [PASSED] drm_gem_shmem_test_madvise
[12:26:17] [PASSED] drm_gem_shmem_test_purge
[12:26:17] ================== [PASSED] drm_gem_shmem ==================
[12:26:17] === drm_atomic_helper_connector_hdmi_check (27 subtests) ===
[12:26:17] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode
[12:26:17] [PASSED] drm_test_check_broadcast_rgb_auto_cea_mode_vic_1
[12:26:17] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode
[12:26:17] [PASSED] drm_test_check_broadcast_rgb_full_cea_mode_vic_1
[12:26:17] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode
[12:26:17] [PASSED] drm_test_check_broadcast_rgb_limited_cea_mode_vic_1
[12:26:17] ====== drm_test_check_broadcast_rgb_cea_mode_yuv420  =======
[12:26:17] [PASSED] Automatic
[12:26:17] [PASSED] Full
[12:26:17] [PASSED] Limited 16:235
[12:26:17] == [PASSED] drm_test_check_broadcast_rgb_cea_mode_yuv420 ===
[12:26:17] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_changed
[12:26:17] [PASSED] drm_test_check_broadcast_rgb_crtc_mode_not_changed
[12:26:17] [PASSED] drm_test_check_disable_connector
[12:26:17] [PASSED] drm_test_check_hdmi_funcs_reject_rate
[12:26:17] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_rgb
[12:26:17] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_yuv420
[12:26:17] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv422
[12:26:17] [PASSED] drm_test_check_max_tmds_rate_bpc_fallback_ignore_yuv420
[12:26:17] [PASSED] drm_test_check_driver_unsupported_fallback_yuv420
[12:26:17] [PASSED] drm_test_check_output_bpc_crtc_mode_changed
[12:26:17] [PASSED] drm_test_check_output_bpc_crtc_mode_not_changed
[12:26:17] [PASSED] drm_test_check_output_bpc_dvi
[12:26:17] [PASSED] drm_test_check_output_bpc_format_vic_1
[12:26:17] [PASSED] drm_test_check_output_bpc_format_display_8bpc_only
[12:26:17] [PASSED] drm_test_check_output_bpc_format_display_rgb_only
[12:26:17] [PASSED] drm_test_check_output_bpc_format_driver_8bpc_only
[12:26:17] [PASSED] drm_test_check_output_bpc_format_driver_rgb_only
[12:26:17] [PASSED] drm_test_check_tmds_char_rate_rgb_8bpc
[12:26:17] [PASSED] drm_test_check_tmds_char_rate_rgb_10bpc
[12:26:17] [PASSED] drm_test_check_tmds_char_rate_rgb_12bpc
[12:26:17] ===== [PASSED] drm_atomic_helper_connector_hdmi_check ======
[12:26:17] === drm_atomic_helper_connector_hdmi_reset (6 subtests) ====
[12:26:17] [PASSED] drm_test_check_broadcast_rgb_value
[12:26:17] [PASSED] drm_test_check_bpc_8_value
[12:26:17] [PASSED] drm_test_check_bpc_10_value
[12:26:17] [PASSED] drm_test_check_bpc_12_value
[12:26:17] [PASSED] drm_test_check_format_value
[12:26:17] [PASSED] drm_test_check_tmds_char_value
[12:26:17] ===== [PASSED] drm_atomic_helper_connector_hdmi_reset ======
[12:26:17] = drm_atomic_helper_connector_hdmi_mode_valid (4 subtests) =
[12:26:17] [PASSED] drm_test_check_mode_valid
[12:26:17] [PASSED] drm_test_check_mode_valid_reject
[12:26:17] [PASSED] drm_test_check_mode_valid_reject_rate
[12:26:17] [PASSED] drm_test_check_mode_valid_reject_max_clock
[12:26:17] === [PASSED] drm_atomic_helper_connector_hdmi_mode_valid ===
[12:26:17] = drm_atomic_helper_connector_hdmi_infoframes (5 subtests) =
[12:26:17] [PASSED] drm_test_check_infoframes
[12:26:17] [PASSED] drm_test_check_reject_avi_infoframe
[12:26:17] [PASSED] drm_test_check_reject_hdr_infoframe_bpc_8
[12:26:17] [PASSED] drm_test_check_reject_hdr_infoframe_bpc_10
[12:26:17] [PASSED] drm_test_check_reject_audio_infoframe
[12:26:17] === [PASSED] drm_atomic_helper_connector_hdmi_infoframes ===
[12:26:17] ================= drm_managed (2 subtests) =================
[12:26:17] [PASSED] drm_test_managed_release_action
[12:26:17] [PASSED] drm_test_managed_run_action
[12:26:17] =================== [PASSED] drm_managed ===================
[12:26:17] =================== drm_mm (6 subtests) ====================
[12:26:17] [PASSED] drm_test_mm_init
[12:26:17] [PASSED] drm_test_mm_debug
[12:26:17] [PASSED] drm_test_mm_align32
[12:26:17] [PASSED] drm_test_mm_align64
[12:26:17] [PASSED] drm_test_mm_lowest
[12:26:17] [PASSED] drm_test_mm_highest
[12:26:17] ===================== [PASSED] drm_mm ======================
[12:26:17] ============= drm_modes_analog_tv (5 subtests) =============
[12:26:17] [PASSED] drm_test_modes_analog_tv_mono_576i
[12:26:17] [PASSED] drm_test_modes_analog_tv_ntsc_480i
[12:26:17] [PASSED] drm_test_modes_analog_tv_ntsc_480i_inlined
[12:26:17] [PASSED] drm_test_modes_analog_tv_pal_576i
[12:26:17] [PASSED] drm_test_modes_analog_tv_pal_576i_inlined
[12:26:17] =============== [PASSED] drm_modes_analog_tv ===============
[12:26:17] ============== drm_plane_helper (2 subtests) ===============
[12:26:17] =============== drm_test_check_plane_state  ================
[12:26:17] [PASSED] clipping_simple
[12:26:17] [PASSED] clipping_rotate_reflect
[12:26:17] [PASSED] positioning_simple
[12:26:17] [PASSED] upscaling
[12:26:17] [PASSED] downscaling
[12:26:17] [PASSED] rounding1
[12:26:17] [PASSED] rounding2
[12:26:17] [PASSED] rounding3
[12:26:17] [PASSED] rounding4
[12:26:17] =========== [PASSED] drm_test_check_plane_state ============
[12:26:17] =========== drm_test_check_invalid_plane_state  ============
[12:26:17] [PASSED] positioning_invalid
[12:26:17] [PASSED] upscaling_invalid
[12:26:17] [PASSED] downscaling_invalid
[12:26:17] ======= [PASSED] drm_test_check_invalid_plane_state ========
[12:26:17] ================ [PASSED] drm_plane_helper =================
[12:26:17] ====== drm_connector_helper_tv_get_modes (1 subtest) =======
[12:26:17] ====== drm_test_connector_helper_tv_get_modes_check  =======
[12:26:17] [PASSED] None
[12:26:17] [PASSED] PAL
[12:26:17] [PASSED] NTSC
[12:26:17] [PASSED] Both, NTSC Default
[12:26:17] [PASSED] Both, PAL Default
[12:26:17] [PASSED] Both, NTSC Default, with PAL on command-line
[12:26:17] [PASSED] Both, PAL Default, with NTSC on command-line
[12:26:17] == [PASSED] drm_test_connector_helper_tv_get_modes_check ===
[12:26:17] ======== [PASSED] drm_connector_helper_tv_get_modes ========
[12:26:17] ================== drm_rect (9 subtests) ===================
[12:26:17] [PASSED] drm_test_rect_clip_scaled_div_by_zero
[12:26:17] [PASSED] drm_test_rect_clip_scaled_not_clipped
[12:26:17] [PASSED] drm_test_rect_clip_scaled_clipped
[12:26:17] [PASSED] drm_test_rect_clip_scaled_signed_vs_unsigned
[12:26:17] ================= drm_test_rect_intersect  =================
[12:26:17] [PASSED] top-left x bottom-right: 2x2+1+1 x 2x2+0+0
[12:26:17] [PASSED] top-right x bottom-left: 2x2+0+0 x 2x2+1-1
[12:26:17] [PASSED] bottom-left x top-right: 2x2+1-1 x 2x2+0+0
[12:26:17] [PASSED] bottom-right x top-left: 2x2+0+0 x 2x2+1+1
[12:26:17] [PASSED] right x left: 2x1+0+0 x 3x1+1+0
[12:26:17] [PASSED] left x right: 3x1+1+0 x 2x1+0+0
[12:26:17] [PASSED] up x bottom: 1x2+0+0 x 1x3+0-1
[12:26:17] [PASSED] bottom x up: 1x3+0-1 x 1x2+0+0
[12:26:17] [PASSED] touching corner: 1x1+0+0 x 2x2+1+1
[12:26:17] [PASSED] touching side: 1x1+0+0 x 1x1+1+0
[12:26:17] [PASSED] equal rects: 2x2+0+0 x 2x2+0+0
[12:26:17] [PASSED] inside another: 2x2+0+0 x 1x1+1+1
[12:26:17] [PASSED] far away: 1x1+0+0 x 1x1+3+6
[12:26:17] [PASSED] points intersecting: 0x0+5+10 x 0x0+5+10
[12:26:17] [PASSED] points not intersecting: 0x0+0+0 x 0x0+5+10
[12:26:17] ============= [PASSED] drm_test_rect_intersect =============
[12:26:17] ================ drm_test_rect_calc_hscale  ================
[12:26:17] [PASSED] normal use
[12:26:17] [PASSED] out of max range
[12:26:17] [PASSED] out of min range
[12:26:17] [PASSED] zero dst
[12:26:17] [PASSED] negative src
[12:26:17] [PASSED] negative dst
[12:26:17] ============ [PASSED] drm_test_rect_calc_hscale ============
[12:26:17] ================ drm_test_rect_calc_vscale  ================
[12:26:17] [PASSED] normal use
[12:26:17] [PASSED] out of max range
[12:26:17] [PASSED] out of min range
[12:26:17] [PASSED] zero dst
[12:26:17] [PASSED] negative src
[12:26:17] [PASSED] negative dst
stty: 'standard input': Inappropriate ioctl for device
[12:26:17] ============ [PASSED] drm_test_rect_calc_vscale ============
[12:26:17] ================== drm_test_rect_rotate  ===================
[12:26:17] [PASSED] reflect-x
[12:26:17] [PASSED] reflect-y
[12:26:17] [PASSED] rotate-0
[12:26:17] [PASSED] rotate-90
[12:26:17] [PASSED] rotate-180
[12:26:17] [PASSED] rotate-270
[12:26:17] ============== [PASSED] drm_test_rect_rotate ===============
[12:26:17] ================ drm_test_rect_rotate_inv  =================
[12:26:17] [PASSED] reflect-x
[12:26:17] [PASSED] reflect-y
[12:26:17] [PASSED] rotate-0
[12:26:17] [PASSED] rotate-90
[12:26:17] [PASSED] rotate-180
[12:26:17] [PASSED] rotate-270
[12:26:17] ============ [PASSED] drm_test_rect_rotate_inv =============
[12:26:17] ==================== [PASSED] drm_rect =====================
[12:26:17] ============ drm_sysfb_modeset_test (1 subtest) ============
[12:26:17] ============ drm_test_sysfb_build_fourcc_list  =============
[12:26:17] [PASSED] no native formats
[12:26:17] [PASSED] XRGB8888 as native format
[12:26:17] [PASSED] remove duplicates
[12:26:17] [PASSED] convert alpha formats
[12:26:17] [PASSED] random formats
[12:26:17] ======== [PASSED] drm_test_sysfb_build_fourcc_list =========
[12:26:17] ============= [PASSED] drm_sysfb_modeset_test ==============
[12:26:17] ================== drm_fixp (2 subtests) ===================
[12:26:17] [PASSED] drm_test_int2fixp
[12:26:17] [PASSED] drm_test_sm2fixp
[12:26:17] ==================== [PASSED] drm_fixp =====================
[12:26:17] ============================================================
[12:26:17] Testing complete. Ran 621 tests: passed: 621
[12:26:17] Elapsed time: 37.014s total, 1.720s configuring, 35.120s building, 0.165s running

+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/ttm/tests/.kunitconfig
[12:26:17] Configuring KUnit Kernel ...
Regenerating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[12:26:19] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
[12:26:36] Starting KUnit Kernel (1/1)...
[12:26:36] ============================================================
Running tests with:
$ .kunit/linux kunit.enable=1 mem=1G console=tty kunit_shutdown=halt
[12:26:36] ================= ttm_device (5 subtests) ==================
[12:26:36] [PASSED] ttm_device_init_basic
[12:26:36] [PASSED] ttm_device_init_multiple
[12:26:36] [PASSED] ttm_device_fini_basic
[12:26:36] [PASSED] ttm_device_init_no_vma_man
[12:26:36] ================== ttm_device_init_pools  ==================
[12:26:36] [PASSED] No DMA allocations, no DMA32 required
[12:26:36] [PASSED] DMA allocations, DMA32 required
[12:26:36] [PASSED] No DMA allocations, DMA32 required
[12:26:36] [PASSED] DMA allocations, no DMA32 required
[12:26:36] ============== [PASSED] ttm_device_init_pools ==============
[12:26:36] =================== [PASSED] ttm_device ====================
[12:26:36] ================== ttm_pool (8 subtests) ===================
[12:26:36] ================== ttm_pool_alloc_basic  ===================
[12:26:36] [PASSED] One page
[12:26:36] [PASSED] More than one page
[12:26:36] [PASSED] Above the allocation limit
[12:26:36] [PASSED] One page, with coherent DMA mappings enabled
[12:26:36] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[12:26:36] ============== [PASSED] ttm_pool_alloc_basic ===============
[12:26:36] ============== ttm_pool_alloc_basic_dma_addr  ==============
[12:26:36] [PASSED] One page
[12:26:36] [PASSED] More than one page
[12:26:36] [PASSED] Above the allocation limit
[12:26:36] [PASSED] One page, with coherent DMA mappings enabled
[12:26:36] [PASSED] Above the allocation limit, with coherent DMA mappings enabled
[12:26:36] ========== [PASSED] ttm_pool_alloc_basic_dma_addr ==========
[12:26:36] [PASSED] ttm_pool_alloc_order_caching_match
[12:26:36] [PASSED] ttm_pool_alloc_caching_mismatch
[12:26:36] [PASSED] ttm_pool_alloc_order_mismatch
[12:26:36] [PASSED] ttm_pool_free_dma_alloc
[12:26:36] [PASSED] ttm_pool_free_no_dma_alloc
[12:26:36] [PASSED] ttm_pool_fini_basic
[12:26:36] ==================== [PASSED] ttm_pool =====================
[12:26:36] ================ ttm_resource (8 subtests) =================
[12:26:36] ================= ttm_resource_init_basic  =================
[12:26:36] [PASSED] Init resource in TTM_PL_SYSTEM
[12:26:36] [PASSED] Init resource in TTM_PL_VRAM
[12:26:36] [PASSED] Init resource in a private placement
[12:26:36] [PASSED] Init resource in TTM_PL_SYSTEM, set placement flags
[12:26:36] ============= [PASSED] ttm_resource_init_basic =============
[12:26:36] [PASSED] ttm_resource_init_pinned
[12:26:36] [PASSED] ttm_resource_fini_basic
[12:26:36] [PASSED] ttm_resource_manager_init_basic
[12:26:36] [PASSED] ttm_resource_manager_usage_basic
[12:26:36] [PASSED] ttm_resource_manager_set_used_basic
[12:26:36] [PASSED] ttm_sys_man_alloc_basic
[12:26:36] [PASSED] ttm_sys_man_free_basic
[12:26:36] ================== [PASSED] ttm_resource ===================
[12:26:36] =================== ttm_tt (15 subtests) ===================
[12:26:36] ==================== ttm_tt_init_basic  ====================
[12:26:36] [PASSED] Page-aligned size
[12:26:36] [PASSED] Extra pages requested
[12:26:36] ================ [PASSED] ttm_tt_init_basic ================
[12:26:36] [PASSED] ttm_tt_init_misaligned
[12:26:36] [PASSED] ttm_tt_fini_basic
[12:26:36] [PASSED] ttm_tt_fini_sg
[12:26:36] [PASSED] ttm_tt_fini_shmem
[12:26:36] [PASSED] ttm_tt_create_basic
[12:26:36] [PASSED] ttm_tt_create_invalid_bo_type
[12:26:36] [PASSED] ttm_tt_create_ttm_exists
[12:26:36] [PASSED] ttm_tt_create_failed
[12:26:36] [PASSED] ttm_tt_destroy_basic
[12:26:36] [PASSED] ttm_tt_populate_null_ttm
[12:26:36] [PASSED] ttm_tt_populate_populated_ttm
[12:26:36] [PASSED] ttm_tt_unpopulate_basic
[12:26:36] [PASSED] ttm_tt_unpopulate_empty_ttm
[12:26:36] [PASSED] ttm_tt_swapin_basic
[12:26:36] ===================== [PASSED] ttm_tt ======================
[12:26:36] =================== ttm_bo (14 subtests) ===================
[12:26:36] =========== ttm_bo_reserve_optimistic_no_ticket  ===========
[12:26:36] [PASSED] Cannot be interrupted and sleeps
[12:26:36] [PASSED] Cannot be interrupted, locks straight away
[12:26:36] [PASSED] Can be interrupted, sleeps
[12:26:36] ======= [PASSED] ttm_bo_reserve_optimistic_no_ticket =======
[12:26:36] [PASSED] ttm_bo_reserve_locked_no_sleep
[12:26:36] [PASSED] ttm_bo_reserve_no_wait_ticket
[12:26:36] [PASSED] ttm_bo_reserve_double_resv
[12:26:36] [PASSED] ttm_bo_reserve_interrupted
[12:26:36] [PASSED] ttm_bo_reserve_deadlock
[12:26:36] [PASSED] ttm_bo_unreserve_basic
[12:26:36] [PASSED] ttm_bo_unreserve_pinned
[12:26:36] [PASSED] ttm_bo_unreserve_bulk
[12:26:36] [PASSED] ttm_bo_fini_basic
[12:26:36] [PASSED] ttm_bo_fini_shared_resv
[12:26:36] [PASSED] ttm_bo_pin_basic
[12:26:36] [PASSED] ttm_bo_pin_unpin_resource
[12:26:36] [PASSED] ttm_bo_multiple_pin_one_unpin
[12:26:36] ===================== [PASSED] ttm_bo ======================
[12:26:36] ============== ttm_bo_validate (22 subtests) ===============
[12:26:36] ============== ttm_bo_init_reserved_sys_man  ===============
[12:26:36] [PASSED] Buffer object for userspace
[12:26:36] [PASSED] Kernel buffer object
[12:26:36] [PASSED] Shared buffer object
[12:26:36] ========== [PASSED] ttm_bo_init_reserved_sys_man ===========
[12:26:36] ============== ttm_bo_init_reserved_mock_man  ==============
[12:26:36] [PASSED] Buffer object for userspace
[12:26:36] [PASSED] Kernel buffer object
[12:26:36] [PASSED] Shared buffer object
[12:26:36] ========== [PASSED] ttm_bo_init_reserved_mock_man ==========
[12:26:36] [PASSED] ttm_bo_init_reserved_resv
[12:26:36] ================== ttm_bo_validate_basic  ==================
[12:26:36] [PASSED] Buffer object for userspace
[12:26:36] [PASSED] Kernel buffer object
[12:26:36] [PASSED] Shared buffer object
[12:26:36] ============== [PASSED] ttm_bo_validate_basic ==============
[12:26:36] [PASSED] ttm_bo_validate_invalid_placement
[12:26:36] ============= ttm_bo_validate_same_placement  ==============
[12:26:36] [PASSED] System manager
[12:26:36] [PASSED] VRAM manager
[12:26:36] ========= [PASSED] ttm_bo_validate_same_placement ==========
[12:26:36] [PASSED] ttm_bo_validate_failed_alloc
[12:26:36] [PASSED] ttm_bo_validate_pinned
[12:26:36] [PASSED] ttm_bo_validate_busy_placement
[12:26:36] ================ ttm_bo_validate_multihop  =================
[12:26:36] [PASSED] Buffer object for userspace
[12:26:36] [PASSED] Kernel buffer object
[12:26:36] [PASSED] Shared buffer object
[12:26:36] ============ [PASSED] ttm_bo_validate_multihop =============
[12:26:36] ========== ttm_bo_validate_no_placement_signaled  ==========
[12:26:36] [PASSED] Buffer object in system domain, no page vector
[12:26:36] [PASSED] Buffer object in system domain with an existing page vector
[12:26:36] ====== [PASSED] ttm_bo_validate_no_placement_signaled ======
[12:26:36] ======== ttm_bo_validate_no_placement_not_signaled  ========
[12:26:36] [PASSED] Buffer object for userspace
[12:26:36] [PASSED] Kernel buffer object
[12:26:36] [PASSED] Shared buffer object
[12:26:36] ==== [PASSED] ttm_bo_validate_no_placement_not_signaled ====
[12:26:36] [PASSED] ttm_bo_validate_move_fence_signaled
[12:26:36] ========= ttm_bo_validate_move_fence_not_signaled  =========
[12:26:36] [PASSED] Waits for GPU
[12:26:36] [PASSED] Tries to lock straight away
[12:26:36] ===== [PASSED] ttm_bo_validate_move_fence_not_signaled =====
[12:26:36] [PASSED] ttm_bo_validate_swapout
[12:26:36] [PASSED] ttm_bo_validate_happy_evict
[12:26:36] [PASSED] ttm_bo_validate_all_pinned_evict
[12:26:36] [PASSED] ttm_bo_validate_allowed_only_evict
[12:26:36] [PASSED] ttm_bo_validate_deleted_evict
[12:26:36] [PASSED] ttm_bo_validate_busy_domain_evict
[12:26:36] [PASSED] ttm_bo_validate_evict_gutting
[12:26:36] [PASSED] ttm_bo_validate_recrusive_evict
stty: 'standard input': Inappropriate ioctl for device
[12:26:36] ================= [PASSED] ttm_bo_validate =================
[12:26:36] ============================================================
[12:26:36] Testing complete. Ran 102 tests: passed: 102
[12:26:36] Elapsed time: 19.464s total, 2.672s configuring, 16.517s building, 0.260s running

+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 24+ messages in thread

* ✓ Xe.CI.BAT: success for Add memory page offlining support (rev6)
  2026-03-27 11:48 [RFC PATCH V6 0/7] Add memory page offlining support Tejas Upadhyay
                   ` (8 preceding siblings ...)
  2026-03-27 12:26 ` ✓ CI.KUnit: success " Patchwork
@ 2026-03-27 13:16 ` Patchwork
  2026-03-28  4:49 ` ✓ Xe.CI.FULL: " Patchwork
  10 siblings, 0 replies; 24+ messages in thread
From: Patchwork @ 2026-03-27 13:16 UTC (permalink / raw)
  To: Tejas Upadhyay; +Cc: intel-xe

[-- Attachment #1: Type: text/plain, Size: 1860 bytes --]

== Series Details ==

Series: Add memory page offlining support (rev6)
URL   : https://patchwork.freedesktop.org/series/161473/
State : success

== Summary ==

CI Bug Log - changes from xe-4805-b75bacf9b36568f2ebf5319b955ba0c9cab5d6d1_BAT -> xe-pw-161473v6_BAT
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  

Participating hosts (14 -> 14)
------------------------------

  No changes in participating hosts

Known issues
------------

  Here are the changes found in xe-pw-161473v6_BAT that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@xe_waitfence@abstime:
    - bat-dg2-oem2:       [PASS][1] -> [TIMEOUT][2] ([Intel XE#6506])
   [1]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4805-b75bacf9b36568f2ebf5319b955ba0c9cab5d6d1/bat-dg2-oem2/igt@xe_waitfence@abstime.html
   [2]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-161473v6/bat-dg2-oem2/igt@xe_waitfence@abstime.html

  
#### Possible fixes ####

  * igt@xe_waitfence@reltime:
    - bat-dg2-oem2:       [FAIL][3] ([Intel XE#6520]) -> [PASS][4]
   [3]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-4805-b75bacf9b36568f2ebf5319b955ba0c9cab5d6d1/bat-dg2-oem2/igt@xe_waitfence@reltime.html
   [4]: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-161473v6/bat-dg2-oem2/igt@xe_waitfence@reltime.html

  
  [Intel XE#6506]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6506
  [Intel XE#6520]: https://gitlab.freedesktop.org/drm/xe/kernel/issues/6520


Build changes
-------------

  * Linux: xe-4805-b75bacf9b36568f2ebf5319b955ba0c9cab5d6d1 -> xe-pw-161473v6

  IGT_8834: 8834
  xe-4805-b75bacf9b36568f2ebf5319b955ba0c9cab5d6d1: b75bacf9b36568f2ebf5319b955ba0c9cab5d6d1
  xe-pw-161473v6: 161473v6

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-161473v6/index.html

[-- Attachment #2: Type: text/html, Size: 2447 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* ✓ Xe.CI.FULL: success for Add memory page offlining support (rev6)
  2026-03-27 11:48 [RFC PATCH V6 0/7] Add memory page offlining support Tejas Upadhyay
                   ` (9 preceding siblings ...)
  2026-03-27 13:16 ` ✓ Xe.CI.BAT: " Patchwork
@ 2026-03-28  4:49 ` Patchwork
  10 siblings, 0 replies; 24+ messages in thread
From: Patchwork @ 2026-03-28  4:49 UTC (permalink / raw)
  To: Tejas Upadhyay; +Cc: intel-xe

[-- Attachment #1: Type: text/plain, Size: 863 bytes --]

== Series Details ==

Series: Add memory page offlining support (rev6)
URL   : https://patchwork.freedesktop.org/series/161473/
State : success

== Summary ==

CI Bug Log - changes from xe-4805-b75bacf9b36568f2ebf5319b955ba0c9cab5d6d1_FULL -> xe-pw-161473v6_FULL
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  

Participating hosts (2 -> 2)
------------------------------

  No changes in participating hosts


Changes
-------

  No changes found


Build changes
-------------

  * Linux: xe-4805-b75bacf9b36568f2ebf5319b955ba0c9cab5d6d1 -> xe-pw-161473v6

  IGT_8834: 8834
  xe-4805-b75bacf9b36568f2ebf5319b955ba0c9cab5d6d1: b75bacf9b36568f2ebf5319b955ba0c9cab5d6d1
  xe-pw-161473v6: 161473v6

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-161473v6/index.html

[-- Attachment #2: Type: text/html, Size: 1411 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH V6 3/7] drm/xe: Handle physical memory address error
  2026-03-27 11:48 ` [RFC PATCH V6 3/7] drm/xe: Handle physical memory address error Tejas Upadhyay
@ 2026-04-01 23:53   ` Matthew Brost
  2026-04-02  1:03   ` Matthew Brost
  1 sibling, 0 replies; 24+ messages in thread
From: Matthew Brost @ 2026-04-01 23:53 UTC (permalink / raw)
  To: Tejas Upadhyay
  Cc: intel-xe, matthew.auld, thomas.hellstrom, himal.prasad.ghimiray

On Fri, Mar 27, 2026 at 05:18:16PM +0530, Tejas Upadhyay wrote:
> This functionality represents a significant step in making
> the xe driver gracefully handle hardware memory degradation.
> By integrating with the DRM Buddy allocator, the driver
> can permanently "carve out" faulty memory so it isn't reused
> by subsequent allocations.
> 
> Buddy Block Reservation:
> ----------------------
> When a memory address is reported as faulty, the driver instructs
> the DRM Buddy allocator to reserve a block of the specific page
> size (typically 4KB). This marks the memory as "dirty/used"
> indefinitely.
> 
> Two-Stage Tracking:
> -----------------
> Offlined Pages:
> Pages that have been successfully isolated and removed from the
> available memory pool.
> 
> Queued Pages:
> Addresses that have been flagged as faulty but are currently in
> use by a process. These are tracked until the associated buffer
> object (BO) is released or migrated, at which point they move
> to the "offlined" state.
> 
> Sysfs Reporting:
> --------------
> The patch exposes these metrics through a standard interface,
> allowing administrators to monitor VRAM health:
> /sys/bus/pci/devices/<device_id>/vram_bad_bad_pages
> 
> V5:
> - Categorise and handle BOs accordingly
> - Fix crash found with new debugfs tests
> V4:
> - Set block->private NULL post bo purge
> - Filter out gsm address early on
> - Rebase
> V3:
> -rename api, remove tile dependency and add status of reservation
> V2:
> - Fix mm->avail counter issue
> - Remove unused code and handle clean up in case of error
> 
> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       | 336 +++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_ttm_vram_mgr.h       |   1 +
>  drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  26 ++
>  3 files changed, 363 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> index c627dbf94552..0fec7b332501 100644
> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> @@ -13,7 +13,10 @@
>  
>  #include "xe_bo.h"
>  #include "xe_device.h"
> +#include "xe_exec_queue.h"
> +#include "xe_lrc.h"
>  #include "xe_res_cursor.h"
> +#include "xe_ttm_stolen_mgr.h"
>  #include "xe_ttm_vram_mgr.h"
>  #include "xe_vram_types.h"
>  
> @@ -277,6 +280,26 @@ static const struct ttm_resource_manager_func xe_ttm_vram_mgr_func = {
>  	.debug	= xe_ttm_vram_mgr_debug
>  };
>  
> +static void xe_ttm_vram_free_bad_pages(struct drm_device *dev, struct xe_ttm_vram_mgr *mgr)
> +{
> +	struct xe_ttm_vram_offline_resource *pos, *n;
> +
> +	mutex_lock(&mgr->lock);
> +	list_for_each_entry_safe(pos, n, &mgr->offlined_pages, offlined_link) {
> +		--mgr->n_offlined_pages;
> +		gpu_buddy_free_list(&mgr->mm, &pos->blocks, 0);
> +		mgr->visible_avail += pos->used_visible_size;
> +		list_del(&pos->offlined_link);
> +		kfree(pos);
> +	}
> +	list_for_each_entry_safe(pos, n, &mgr->queued_pages, queued_link) {
> +		list_del(&pos->queued_link);
> +		mgr->n_queued_pages--;
> +		kfree(pos);
> +	}
> +	mutex_unlock(&mgr->lock);
> +}
> +
>  static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
>  {
>  	struct xe_device *xe = to_xe_device(dev);
> @@ -288,6 +311,8 @@ static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
>  	if (ttm_resource_manager_evict_all(&xe->ttm, man))
>  		return;
>  
> +	xe_ttm_vram_free_bad_pages(dev, mgr);
> +
>  	WARN_ON_ONCE(mgr->visible_avail != mgr->visible_size);
>  
>  	gpu_buddy_fini(&mgr->mm);
> @@ -316,6 +341,8 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
>  	man->func = &xe_ttm_vram_mgr_func;
>  	mgr->mem_type = mem_type;
>  	mutex_init(&mgr->lock);
> +	INIT_LIST_HEAD(&mgr->offlined_pages);
> +	INIT_LIST_HEAD(&mgr->queued_pages);
>  	mgr->default_page_size = default_page_size;
>  	mgr->visible_size = io_size;
>  	mgr->visible_avail = io_size;
> @@ -471,3 +498,312 @@ u64 xe_ttm_vram_get_avail(struct ttm_resource_manager *man)
>  
>  	return avail;
>  }
> +
> +static bool is_ttm_vram_migrate_lrc(struct xe_device *xe, struct xe_bo *pbo)
> +{

The locking is def not correct in this function but I don't think you
need this function. More below.

> +	if (pbo->ttm.type == ttm_bo_type_kernel &&
> +	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
> +	    (pbo->flags & (XE_BO_FLAG_GGTT | XE_BO_FLAG_GGTT_INVALIDATE)) &&
> +	    !(pbo->flags & XE_BO_FLAG_PAGETABLE)) {
> +		unsigned long idx;
> +		struct xe_exec_queue *q;
> +		struct drm_device *dev = &xe->drm;
> +		struct drm_file *file;
> +		struct xe_lrc *lrc;
> +
> +		/* TODO : Need to extend to multitile in future if needed */
> +		mutex_lock(&dev->filelist_mutex);
> +		list_for_each_entry(file, &dev->filelist, lhead) {
> +			struct xe_file *xef = file->driver_priv;
> +
> +			mutex_lock(&xef->exec_queue.lock);
> +			xa_for_each(&xef->exec_queue.xa, idx, q) {
> +				xe_exec_queue_get(q);
> +				mutex_unlock(&xef->exec_queue.lock);
> +
> +				for (int i = 0; i < q->width; i++) {
> +					lrc = xe_exec_queue_get_lrc(q, i);
> +					if (lrc->bo == pbo) {
> +						xe_lrc_put(lrc);
> +						mutex_lock(&xef->exec_queue.lock);
> +						xe_exec_queue_put(q);
> +						mutex_unlock(&xef->exec_queue.lock);
> +						mutex_unlock(&dev->filelist_mutex);
> +						return false;
> +					}
> +					xe_lrc_put(lrc);
> +				}
> +				mutex_lock(&xef->exec_queue.lock);
> +				xe_exec_queue_put(q);
> +				mutex_unlock(&xef->exec_queue.lock);
> +			}
> +		}
> +		mutex_unlock(&dev->filelist_mutex);
> +		return true;
> +	}
> +	return false;
> +}
> +
> +static void xe_ttm_vram_purge_page(struct xe_device *xe, struct xe_bo *pbo)
> +{
> +	struct ttm_placement place = {};
> +	struct ttm_operation_ctx ctx = {
> +		.interruptible = false,
> +		.gfp_retry_mayfail = false,
> +	};
> +	bool locked;
> +	int ret = 0;
> +
> +	/*  Ban VM if BO is PPGTT */
> +	if (pbo->ttm.type == ttm_bo_type_kernel &&
> +	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
> +	    pbo->flags & XE_BO_FLAG_PAGETABLE) {
> +		down_write(&pbo->vm->lock);
> +		xe_vm_kill(pbo->vm, true);
> +		up_write(&pbo->vm->lock);
> +	}
> +
> +	/*  Ban exec queue if BO is lrc */
> +	if (pbo->ttm.type == ttm_bo_type_kernel &&
> +	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
> +	    (pbo->flags & (XE_BO_FLAG_GGTT | XE_BO_FLAG_GGTT_INVALIDATE)) &&
> +	    !(pbo->flags & XE_BO_FLAG_PAGETABLE)) {
> +		struct drm_device *dev = &xe->drm;
> +		struct xe_exec_queue *q;
> +		struct drm_file *file;
> +		struct xe_lrc *lrc;
> +		unsigned long idx;
> +
> +		/* TODO : Need to extend to multitile in future if needed */
> +		mutex_lock(&dev->filelist_mutex);
> +		list_for_each_entry(file, &dev->filelist, lhead) {
> +			struct xe_file *xef = file->driver_priv;
> +
> +			mutex_lock(&xef->exec_queue.lock);
> +			xa_for_each(&xef->exec_queue.xa, idx, q) {
> +				xe_exec_queue_get(q);
> +				mutex_unlock(&xef->exec_queue.lock);
> +
> +				for (int i = 0; i < q->width; i++) {
> +					lrc = xe_exec_queue_get_lrc(q, i);
> +					if (lrc->bo == pbo) {
> +						xe_lrc_put(lrc);
> +						xe_exec_queue_kill(q);
> +					} else {
> +						xe_lrc_put(lrc);
> +					}
> +				}
> +
> +				mutex_lock(&xef->exec_queue.lock);
> +				xe_exec_queue_put(q);
> +				mutex_unlock(&xef->exec_queue.lock);
> +			}
> +		}
> +		mutex_unlock(&dev->filelist_mutex);
> +	}
> +
> +	spin_lock(&pbo->ttm.bdev->lru_lock);
> +	locked = dma_resv_trylock(pbo->ttm.base.resv);
> +	spin_unlock(&pbo->ttm.bdev->lru_lock);
> +	WARN_ON(!locked);
> +	ret = ttm_bo_validate(&pbo->ttm, &place, &ctx);
> +	drm_WARN_ON(&xe->drm, ret);
> +	xe_bo_put(pbo);
> +	if (locked)
> +		dma_resv_unlock(pbo->ttm.base.resv);
> +}
> +
> +static int xe_ttm_vram_reserve_page_at_addr(struct xe_device *xe, unsigned long addr,
> +					    struct xe_ttm_vram_mgr *vram_mgr, struct gpu_buddy *mm)
> +{
> +	struct xe_ttm_vram_offline_resource *nentry;
> +	struct ttm_buffer_object *tbo = NULL;
> +	struct gpu_buddy_block *block;
> +	struct gpu_buddy_block *b, *m;
> +	enum reserve_status {
> +		pending = 0,
> +		fail
> +	};
> +	u64 size = SZ_4K;
> +	int ret = 0;
> +
> +	mutex_lock(&vram_mgr->lock);
> +	block = gpu_buddy_addr_to_block(mm, addr);
> +	if (PTR_ERR(block) == -ENXIO) {
> +		mutex_unlock(&vram_mgr->lock);
> +		return -ENXIO;
> +	}
> +
> +	nentry = kzalloc_obj(*nentry);
> +	if (!nentry)
> +		return -ENOMEM;
> +	INIT_LIST_HEAD(&nentry->blocks);
> +	nentry->status = pending;
> +
> +	if (block) {
> +		struct xe_ttm_vram_offline_resource *pos, *n;
> +		struct xe_bo *pbo;
> +
> +		WARN_ON(!block->private);
> +		tbo = block->private;
> +		pbo = ttm_to_xe_bo(tbo);
> +
> +		xe_bo_get(pbo);
> +		/* Critical kernel BO? */
> +		if (pbo->ttm.type == ttm_bo_type_kernel &&
> +		    (!(pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM) ||

Wouldn't it be easier to just add flag XE_BO_FLAG_KERNEL_CRITICAL then
update all BOs we create at driver with this flag?

We then can drop is_ttm_vram_migrate_lrc.

Matt

> +		     is_ttm_vram_migrate_lrc(xe, pbo))) {
> +			mutex_unlock(&vram_mgr->lock);
> +			kfree(nentry);
> +			xe_ttm_vram_free_bad_pages(&xe->drm, vram_mgr);
> +			xe_bo_put(pbo);
> +			drm_err(&xe->drm,
> +				"%s: corrupt addr: 0x%lx in critical kernel bo, request reset\n",
> +				__func__, addr);
> +			/* Hint System controller driver for reset with -EIO  */
> +			return -EIO;
> +		}
> +		nentry->id = ++vram_mgr->n_queued_pages;
> +		list_add(&nentry->queued_link, &vram_mgr->queued_pages);
> +		mutex_unlock(&vram_mgr->lock);
> +
> +		/* Purge BO containing address */
> +		 xe_ttm_vram_purge_page(xe, pbo);
> +
> +		/* Reserve page at address addr*/
> +		mutex_lock(&vram_mgr->lock);
> +		ret = gpu_buddy_alloc_blocks(mm, addr, addr + size,
> +					     size, size, &nentry->blocks,
> +					     GPU_BUDDY_RANGE_ALLOCATION);
> +
> +		if (ret) {
> +			drm_warn(&xe->drm, "Could not reserve page at addr:0x%lx, ret:%d\n",
> +				 addr, ret);
> +			nentry->status = fail;
> +			mutex_unlock(&vram_mgr->lock);
> +			return ret;
> +		}
> +
> +		list_for_each_entry_safe(b, m, &nentry->blocks, link)
> +			b->private = NULL;
> +
> +		if ((addr + size) <= vram_mgr->visible_size) {
> +			nentry->used_visible_size = size;
> +		} else {
> +			list_for_each_entry(b, &nentry->blocks, link) {
> +				u64 start = gpu_buddy_block_offset(b);
> +
> +				if (start < vram_mgr->visible_size) {
> +					u64 end = start + gpu_buddy_block_size(mm, b);
> +
> +					nentry->used_visible_size +=
> +						min(end, vram_mgr->visible_size) - start;
> +				}
> +			}
> +		}
> +		vram_mgr->visible_avail -= nentry->used_visible_size;
> +		list_for_each_entry_safe(pos, n, &vram_mgr->queued_pages, queued_link) {
> +			if (pos->id == nentry->id) {
> +				--vram_mgr->n_queued_pages;
> +				list_del(&pos->queued_link);
> +				break;
> +			}
> +		}
> +		list_add(&nentry->offlined_link, &vram_mgr->offlined_pages);
> +		/* TODO: FW Integration: Send command to FW for offlining page */
> +		++vram_mgr->n_offlined_pages;
> +		mutex_unlock(&vram_mgr->lock);
> +		return ret;
> +
> +	} else {
> +		ret = gpu_buddy_alloc_blocks(mm, addr, addr + size,
> +					     size, size, &nentry->blocks,
> +					     GPU_BUDDY_RANGE_ALLOCATION);
> +		if (ret) {
> +			drm_warn(&xe->drm, "Could not reserve page at addr:0x%lx, ret:%d\n",
> +				 addr, ret);
> +			nentry->status = fail;
> +			mutex_unlock(&vram_mgr->lock);
> +			return ret;
> +		}
> +
> +		list_for_each_entry_safe(b, m, &nentry->blocks, link)
> +			b->private = NULL;
> +
> +		if ((addr + size) <= vram_mgr->visible_size) {
> +			nentry->used_visible_size = size;
> +		} else {
> +			struct gpu_buddy_block *block;
> +
> +			list_for_each_entry(block, &nentry->blocks, link) {
> +				u64 start = gpu_buddy_block_offset(block);
> +
> +				if (start < vram_mgr->visible_size) {
> +					u64 end = start + gpu_buddy_block_size(mm, block);
> +
> +					nentry->used_visible_size +=
> +						min(end, vram_mgr->visible_size) - start;
> +				}
> +			}
> +		}
> +		vram_mgr->visible_avail -= nentry->used_visible_size;
> +		nentry->id = ++vram_mgr->n_offlined_pages;
> +		list_add(&nentry->offlined_link, &vram_mgr->offlined_pages);
> +		/* TODO: FW Integration: Send command to FW for offlining page */
> +		mutex_unlock(&vram_mgr->lock);
> +	}
> +	/* Success */
> +	return ret;
> +}
> +
> +static struct xe_vram_region *xe_ttm_vram_addr_to_region(struct xe_device *xe,
> +							 resource_size_t addr)
> +{
> +	unsigned long stolen_base = xe_ttm_stolen_gpu_offset(xe);
> +	struct xe_vram_region *vr;
> +	struct xe_tile *tile;
> +	int id;
> +
> +	/* Addr from stolen memory? */
> +	if (addr + SZ_4K >= stolen_base)
> +		return NULL;
> +
> +	for_each_tile(tile, xe, id) {
> +		vr = tile->mem.vram;
> +		if ((addr <= vr->dpa_base + vr->actual_physical_size) &&
> +		    (addr + SZ_4K >= vr->dpa_base))
> +			return vr;
> +	}
> +	return NULL;
> +}
> +
> +/**
> + * xe_ttm_vram_handle_addr_fault - Handle vram physical address error flaged
> + * @xe: pointer to parent device
> + * @addr: physical faulty address
> + *
> + * Handle the physcial faulty address error on specific tile.
> + *
> + * Returns 0 for success, negative error code otherwise.
> + */
> +int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long addr)
> +{
> +	struct xe_ttm_vram_mgr *vram_mgr;
> +	struct xe_vram_region *vr;
> +	struct gpu_buddy *mm;
> +	int ret;
> +
> +	vr = xe_ttm_vram_addr_to_region(xe, addr);
> +	if (!vr) {
> +		drm_err(&xe->drm, "%s:%d addr:%lx error requesting SBR\n",
> +			__func__, __LINE__, addr);
> +		/* Hint System controller driver for reset with -EIO  */
> +		return -EIO;
> +	}
> +	vram_mgr = &vr->ttm;
> +	mm = &vram_mgr->mm;
> +	/* Reserve page at address */
> +	ret = xe_ttm_vram_reserve_page_at_addr(xe, addr, vram_mgr, mm);
> +	return ret;
> +}
> +EXPORT_SYMBOL(xe_ttm_vram_handle_addr_fault);
> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> index 87b7fae5edba..8ef06d9d44f7 100644
> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> @@ -31,6 +31,7 @@ u64 xe_ttm_vram_get_cpu_visible_size(struct ttm_resource_manager *man);
>  void xe_ttm_vram_get_used(struct ttm_resource_manager *man,
>  			  u64 *used, u64 *used_visible);
>  
> +int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long addr);
>  static inline struct xe_ttm_vram_mgr_resource *
>  to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
>  {
> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> index 9106da056b49..94eaf9d875f1 100644
> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> @@ -19,6 +19,14 @@ struct xe_ttm_vram_mgr {
>  	struct ttm_resource_manager manager;
>  	/** @mm: DRM buddy allocator which manages the VRAM */
>  	struct gpu_buddy mm;
> +	/** @offlined_pages: List of offlined pages */
> +	struct list_head offlined_pages;
> +	/** @n_offlined_pages: Number of offlined pages */
> +	u16 n_offlined_pages;
> +	/** @queued_pages: List of queued pages */
> +	struct list_head queued_pages;
> +	/** @n_queued_pages: Number of queued pages */
> +	u16 n_queued_pages;
>  	/** @visible_size: Proped size of the CPU visible portion */
>  	u64 visible_size;
>  	/** @visible_avail: CPU visible portion still unallocated */
> @@ -45,4 +53,22 @@ struct xe_ttm_vram_mgr_resource {
>  	unsigned long flags;
>  };
>  
> +/**
> + * struct xe_ttm_vram_offline_resource - Xe TTM VRAM offline  resource
> + */
> +struct xe_ttm_vram_offline_resource {
> +	/** @offlined_link: Link to offlined pages */
> +	struct list_head offlined_link;
> +	/** @queued_link: Link to queued pages */
> +	struct list_head queued_link;
> +	/** @blocks: list of DRM buddy blocks */
> +	struct list_head blocks;
> +	/** @used_visible_size: How many CPU visible bytes this resource is using */
> +	u64 used_visible_size;
> +	/** @id: The id of an offline resource */
> +	u16 id;
> +	/** @status: reservation status of resource */
> +	bool status;
> +};
> +
>  #endif
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH V6 1/7] drm/xe: Link VRAM object with gpu buddy
  2026-03-27 11:48 ` [RFC PATCH V6 1/7] drm/xe: Link VRAM object with gpu buddy Tejas Upadhyay
@ 2026-04-01 23:56   ` Matthew Brost
  2026-04-02  9:10     ` Upadhyay, Tejas
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Brost @ 2026-04-01 23:56 UTC (permalink / raw)
  To: Tejas Upadhyay
  Cc: intel-xe, matthew.auld, thomas.hellstrom, himal.prasad.ghimiray

On Fri, Mar 27, 2026 at 05:18:14PM +0530, Tejas Upadhyay wrote:
> Setup to link TTM buffer object inside gpu buddy. This functionality
> is critical for supporting the memory page offline feature on CRI,
> where identified faulty pages must be traced back to their
> originating buffer for safe removal.
> 

I just checked the SVM code is still setting 'block->private'.

I thought we had patch for that but I don't see it in this series.

Matt

> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> index 5fd0d5506a7e..c627dbf94552 100644
> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> @@ -54,6 +54,7 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager *man,
>  	struct xe_ttm_vram_mgr *mgr = to_xe_ttm_vram_mgr(man);
>  	struct xe_ttm_vram_mgr_resource *vres;
>  	struct gpu_buddy *mm = &mgr->mm;
> +	struct gpu_buddy_block *block;
>  	u64 size, min_page_size;
>  	unsigned long lpfn;
>  	int err;
> @@ -138,6 +139,8 @@ static int xe_ttm_vram_mgr_new(struct ttm_resource_manager *man,
>  	}
>  
>  	mgr->visible_avail -= vres->used_visible_size;
> +	list_for_each_entry(block, &vres->blocks, link)
> +		block->private = tbo;
>  	mutex_unlock(&mgr->lock);
>  
>  	if (!(vres->base.placement & TTM_PL_FLAG_CONTIGUOUS) &&
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH V6 2/7] drm/gpu: Add gpu_buddy_addr_to_block helper
  2026-03-27 11:48 ` [RFC PATCH V6 2/7] drm/gpu: Add gpu_buddy_addr_to_block helper Tejas Upadhyay
@ 2026-04-02  0:09   ` Matthew Brost
  2026-04-02 10:16     ` Matthew Auld
  2026-04-02  9:12   ` Matthew Auld
  1 sibling, 1 reply; 24+ messages in thread
From: Matthew Brost @ 2026-04-02  0:09 UTC (permalink / raw)
  To: Tejas Upadhyay
  Cc: intel-xe, matthew.auld, thomas.hellstrom, himal.prasad.ghimiray

On Fri, Mar 27, 2026 at 05:18:15PM +0530, Tejas Upadhyay wrote:
> Add helper with primary purpose is to efficiently trace a specific
> physical memory address back to its corresponding TTM buffer object.
> 
> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> ---
>  drivers/gpu/buddy.c       | 56 +++++++++++++++++++++++++++++++++++++++
>  include/linux/gpu_buddy.h |  2 ++
>  2 files changed, 58 insertions(+)
> 
> diff --git a/drivers/gpu/buddy.c b/drivers/gpu/buddy.c
> index 52686672e99f..2d26c2a0f971 100644
> --- a/drivers/gpu/buddy.c
> +++ b/drivers/gpu/buddy.c
> @@ -589,6 +589,62 @@ void gpu_buddy_free_block(struct gpu_buddy *mm,
>  }
>  EXPORT_SYMBOL(gpu_buddy_free_block);
>  
> +/**
> + * gpu_buddy_addr_to_block - given physical address find a block
> + *
> + * @mm: GPU buddy manager
> + * @addr: Physical address
> + *
> + * Returns:
> + * gpu_buddy_block on success, NULL or error code on failure
> + */
> +struct gpu_buddy_block *gpu_buddy_addr_to_block(struct gpu_buddy *mm, u64 addr)
> +{
> +	struct gpu_buddy_block *block;
> +	LIST_HEAD(dfs);
> +	u64 end;
> +	int i;
> +

This is somewhat of an orthogonal issue, but we really should update
gpu_buddy so drivers can register a lock that we can assert is held in
functions where gpu_buddy manipulates blocks.

Something like this [1].

I’ll defer to the gpu_buddy maintainers on whether this code is correct.

Matt

[1] https://elixir.bootlin.com/linux/v6.19.10/source/include/drm/drm_gpusvm.h#L341

> +	end = addr + SZ_4K - 1;
> +	for (i = 0; i < mm->n_roots; ++i)
> +		list_add_tail(&mm->roots[i]->tmp_link, &dfs);
> +
> +	do {
> +		u64 block_start;
> +		u64 block_end;
> +
> +		block = list_first_entry_or_null(&dfs,
> +						 struct gpu_buddy_block,
> +						 tmp_link);
> +		if (!block)
> +			break;
> +
> +		list_del(&block->tmp_link);
> +
> +		block_start = gpu_buddy_block_offset(block);
> +		block_end = block_start + gpu_buddy_block_size(mm, block) - 1;
> +
> +		if (!overlaps(addr, end, block_start, block_end))
> +			continue;
> +
> +		if (contains(addr, end, block_start, block_end) &&
> +		    !gpu_buddy_block_is_split(block)) {
> +			if (gpu_buddy_block_is_free(block))
> +				return NULL;
> +			else if (gpu_buddy_block_is_allocated(block) && !mm->clear_avail)
> +				return block;
> +		}
> +
> +		if (gpu_buddy_block_is_split(block)) {
> +			list_add(&block->right->tmp_link, &dfs);
> +			list_add(&block->left->tmp_link, &dfs);
> +		}
> +	} while (1);
> +
> +	return ERR_PTR(-ENXIO);
> +}
> +EXPORT_SYMBOL(gpu_buddy_addr_to_block);
> +
>  static void __gpu_buddy_free_list(struct gpu_buddy *mm,
>  				  struct list_head *objects,
>  				  bool mark_clear,
> diff --git a/include/linux/gpu_buddy.h b/include/linux/gpu_buddy.h
> index 5fa917ba5450..957c69c560bc 100644
> --- a/include/linux/gpu_buddy.h
> +++ b/include/linux/gpu_buddy.h
> @@ -231,6 +231,8 @@ void gpu_buddy_reset_clear(struct gpu_buddy *mm, bool is_clear);
>  
>  void gpu_buddy_free_block(struct gpu_buddy *mm, struct gpu_buddy_block *block);
>  
> +struct gpu_buddy_block *gpu_buddy_addr_to_block(struct gpu_buddy *mm, u64 addr);
> +
>  void gpu_buddy_free_list(struct gpu_buddy *mm,
>  			 struct list_head *objects,
>  			 unsigned int flags);
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH V6 3/7] drm/xe: Handle physical memory address error
  2026-03-27 11:48 ` [RFC PATCH V6 3/7] drm/xe: Handle physical memory address error Tejas Upadhyay
  2026-04-01 23:53   ` Matthew Brost
@ 2026-04-02  1:03   ` Matthew Brost
  2026-04-02 10:30     ` Upadhyay, Tejas
  1 sibling, 1 reply; 24+ messages in thread
From: Matthew Brost @ 2026-04-02  1:03 UTC (permalink / raw)
  To: Tejas Upadhyay
  Cc: intel-xe, matthew.auld, thomas.hellstrom, himal.prasad.ghimiray

On Fri, Mar 27, 2026 at 05:18:16PM +0530, Tejas Upadhyay wrote:
> This functionality represents a significant step in making
> the xe driver gracefully handle hardware memory degradation.
> By integrating with the DRM Buddy allocator, the driver
> can permanently "carve out" faulty memory so it isn't reused
> by subsequent allocations.
> 
> Buddy Block Reservation:
> ----------------------
> When a memory address is reported as faulty, the driver instructs
> the DRM Buddy allocator to reserve a block of the specific page
> size (typically 4KB). This marks the memory as "dirty/used"
> indefinitely.
> 
> Two-Stage Tracking:
> -----------------
> Offlined Pages:
> Pages that have been successfully isolated and removed from the
> available memory pool.
> 
> Queued Pages:
> Addresses that have been flagged as faulty but are currently in
> use by a process. These are tracked until the associated buffer
> object (BO) is released or migrated, at which point they move
> to the "offlined" state.
> 
> Sysfs Reporting:
> --------------
> The patch exposes these metrics through a standard interface,
> allowing administrators to monitor VRAM health:
> /sys/bus/pci/devices/<device_id>/vram_bad_bad_pages
> 
> V5:
> - Categorise and handle BOs accordingly
> - Fix crash found with new debugfs tests
> V4:
> - Set block->private NULL post bo purge
> - Filter out gsm address early on
> - Rebase
> V3:
> -rename api, remove tile dependency and add status of reservation
> V2:
> - Fix mm->avail counter issue
> - Remove unused code and handle clean up in case of error
> 
> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       | 336 +++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_ttm_vram_mgr.h       |   1 +
>  drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  26 ++
>  3 files changed, 363 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> index c627dbf94552..0fec7b332501 100644
> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> @@ -13,7 +13,10 @@
>  
>  #include "xe_bo.h"
>  #include "xe_device.h"
> +#include "xe_exec_queue.h"
> +#include "xe_lrc.h"
>  #include "xe_res_cursor.h"
> +#include "xe_ttm_stolen_mgr.h"
>  #include "xe_ttm_vram_mgr.h"
>  #include "xe_vram_types.h"
>  
> @@ -277,6 +280,26 @@ static const struct ttm_resource_manager_func xe_ttm_vram_mgr_func = {
>  	.debug	= xe_ttm_vram_mgr_debug
>  };
>  
> +static void xe_ttm_vram_free_bad_pages(struct drm_device *dev, struct xe_ttm_vram_mgr *mgr)
> +{
> +	struct xe_ttm_vram_offline_resource *pos, *n;
> +
> +	mutex_lock(&mgr->lock);
> +	list_for_each_entry_safe(pos, n, &mgr->offlined_pages, offlined_link) {
> +		--mgr->n_offlined_pages;
> +		gpu_buddy_free_list(&mgr->mm, &pos->blocks, 0);
> +		mgr->visible_avail += pos->used_visible_size;
> +		list_del(&pos->offlined_link);
> +		kfree(pos);
> +	}
> +	list_for_each_entry_safe(pos, n, &mgr->queued_pages, queued_link) {
> +		list_del(&pos->queued_link);
> +		mgr->n_queued_pages--;
> +		kfree(pos);
> +	}
> +	mutex_unlock(&mgr->lock);
> +}
> +
>  static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
>  {
>  	struct xe_device *xe = to_xe_device(dev);
> @@ -288,6 +311,8 @@ static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
>  	if (ttm_resource_manager_evict_all(&xe->ttm, man))
>  		return;
>  
> +	xe_ttm_vram_free_bad_pages(dev, mgr);
> +
>  	WARN_ON_ONCE(mgr->visible_avail != mgr->visible_size);
>  
>  	gpu_buddy_fini(&mgr->mm);
> @@ -316,6 +341,8 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe, struct xe_ttm_vram_mgr *mgr,
>  	man->func = &xe_ttm_vram_mgr_func;
>  	mgr->mem_type = mem_type;
>  	mutex_init(&mgr->lock);
> +	INIT_LIST_HEAD(&mgr->offlined_pages);
> +	INIT_LIST_HEAD(&mgr->queued_pages);
>  	mgr->default_page_size = default_page_size;
>  	mgr->visible_size = io_size;
>  	mgr->visible_avail = io_size;
> @@ -471,3 +498,312 @@ u64 xe_ttm_vram_get_avail(struct ttm_resource_manager *man)
>  
>  	return avail;
>  }
> +
> +static bool is_ttm_vram_migrate_lrc(struct xe_device *xe, struct xe_bo *pbo)

As discussed in prior reply [1] - I think this can be dropped.

[1] https://patchwork.freedesktop.org/patch/714756/?series=161473&rev=6#comment_1318048

> +{
> +	if (pbo->ttm.type == ttm_bo_type_kernel &&
> +	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
> +	    (pbo->flags & (XE_BO_FLAG_GGTT | XE_BO_FLAG_GGTT_INVALIDATE)) &&
> +	    !(pbo->flags & XE_BO_FLAG_PAGETABLE)) {
> +		unsigned long idx;
> +		struct xe_exec_queue *q;
> +		struct drm_device *dev = &xe->drm;
> +		struct drm_file *file;
> +		struct xe_lrc *lrc;
> +
> +		/* TODO : Need to extend to multitile in future if needed */
> +		mutex_lock(&dev->filelist_mutex);
> +		list_for_each_entry(file, &dev->filelist, lhead) {
> +			struct xe_file *xef = file->driver_priv;
> +
> +			mutex_lock(&xef->exec_queue.lock);
> +			xa_for_each(&xef->exec_queue.xa, idx, q) {
> +				xe_exec_queue_get(q);
> +				mutex_unlock(&xef->exec_queue.lock);
> +
> +				for (int i = 0; i < q->width; i++) {
> +					lrc = xe_exec_queue_get_lrc(q, i);
> +					if (lrc->bo == pbo) {
> +						xe_lrc_put(lrc);
> +						mutex_lock(&xef->exec_queue.lock);
> +						xe_exec_queue_put(q);
> +						mutex_unlock(&xef->exec_queue.lock);
> +						mutex_unlock(&dev->filelist_mutex);
> +						return false;
> +					}
> +					xe_lrc_put(lrc);
> +				}
> +				mutex_lock(&xef->exec_queue.lock);
> +				xe_exec_queue_put(q);
> +				mutex_unlock(&xef->exec_queue.lock);
> +			}
> +		}
> +		mutex_unlock(&dev->filelist_mutex);
> +		return true;
> +	}
> +	return false;
> +}
> +
> +static void xe_ttm_vram_purge_page(struct xe_device *xe, struct xe_bo *pbo)
> +{
> +	struct ttm_placement place = {};
> +	struct ttm_operation_ctx ctx = {
> +		.interruptible = false,
> +		.gfp_retry_mayfail = false,
> +	};
> +	bool locked;
> +	int ret = 0;
> +
> +	/*  Ban VM if BO is PPGTT */
> +	if (pbo->ttm.type == ttm_bo_type_kernel &&
> +	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
> +	    pbo->flags & XE_BO_FLAG_PAGETABLE) {

I think XE_BO_FLAG_PAGETABLE and XE_BO_FLAG_FORCE_USER_VRAM are
sufficient here.

Also, if XE_BO_FLAG_PAGETABLE is set but XE_BO_FLAG_FORCE_USER_VRAM is
clear, that means this is a kernel VM and we probably have to wedge the
device, right?

> +		down_write(&pbo->vm->lock);
> +		xe_vm_kill(pbo->vm, true);
> +		up_write(&pbo->vm->lock);
> +	}
> +
> +	/*  Ban exec queue if BO is lrc */
> +	if (pbo->ttm.type == ttm_bo_type_kernel &&
> +	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
> +	    (pbo->flags & (XE_BO_FLAG_GGTT | XE_BO_FLAG_GGTT_INVALIDATE)) &&
> +	    !(pbo->flags & XE_BO_FLAG_PAGETABLE)) {


This is a huge if statement just to determine whether this is an LRC. At
a minimum, we’d need to normalize this, and it looks very fragile—if we
change flags elsewhere in the driver, this if statement could easily
break.

Also, I can’t say I’m a fan of searching just to kill an individual
queue.

It’s a bit unfortunate that LRCs are created without a VM (I forget the
exact reasoning, but I seem to recall it was related to multi-q?)

I think what we really want to do is:

- If we find a PT or LRC BO, kill the VM.
- Update ‘kill VM’ to kill all exec queues. I honestly forget why we
  only kill preempt/rebind queues—it’s likely some nonsensical reasoning
  that we never cleaned up. We already have xe_vm_add_exec_queue(), which
  is short-circuited on xe->info.has_ctx_tlb_inval, but we can just
  remove that.
- Normalize this with an LRC BO flag and store the user_vm in the BO for
  LRCs.
- Critical kernel BOs normalized with BO flag -> wedge the device

The difference between killing a queue and killing a VM doesn’t really
matter from a user-space point of view, since typically a single-queue
hang leads to the entire process crashing or restarting—at least for
Mesa 3D. We should confirm with compute whether this is also what we’re
targeting for CRI, but I suspect the answer is the same. Even if it
isn’t, I’m not convinced per-queue killing is worthwhile. And if we
decide it is, the filelist / exec_queue.xa search is pretty much a
non-starter for me—for example, we’d need to make this much simpler and
avoid taking a bunch of locks here, which looks pretty scary.

> +		struct drm_device *dev = &xe->drm;
> +		struct xe_exec_queue *q;
> +		struct drm_file *file;
> +		struct xe_lrc *lrc;
> +		unsigned long idx;
> +
> +		/* TODO : Need to extend to multitile in future if needed */
> +		mutex_lock(&dev->filelist_mutex);
> +		list_for_each_entry(file, &dev->filelist, lhead) {
> +			struct xe_file *xef = file->driver_priv;
> +
> +			mutex_lock(&xef->exec_queue.lock);
> +			xa_for_each(&xef->exec_queue.xa, idx, q) {
> +				xe_exec_queue_get(q);
> +				mutex_unlock(&xef->exec_queue.lock);
> +
> +				for (int i = 0; i < q->width; i++) {
> +					lrc = xe_exec_queue_get_lrc(q, i);
> +					if (lrc->bo == pbo) {
> +						xe_lrc_put(lrc);
> +						xe_exec_queue_kill(q);
> +					} else {
> +						xe_lrc_put(lrc);
> +					}
> +				}
> +
> +				mutex_lock(&xef->exec_queue.lock);
> +				xe_exec_queue_put(q);
> +				mutex_unlock(&xef->exec_queue.lock);
> +			}
> +		}
> +		mutex_unlock(&dev->filelist_mutex);
> +	}
> +
> +	spin_lock(&pbo->ttm.bdev->lru_lock);
> +	locked = dma_resv_trylock(pbo->ttm.base.resv);
> +	spin_unlock(&pbo->ttm.bdev->lru_lock);
> +	WARN_ON(!locked);

Is there any reason why we can’t just take a sleeping dma_resv_lock
here (e.g. xe_bo_lock)? Also, I think the trick with the LRU lock only
works once the BO’s dma_resv has been individualized (kref == 0), which
is clearly not the case here. 

> +	ret = ttm_bo_validate(&pbo->ttm, &place, &ctx);
> +	drm_WARN_ON(&xe->drm, ret);
> +	xe_bo_put(pbo);
> +	if (locked)
> +		dma_resv_unlock(pbo->ttm.base.resv);
> +}
> +
> +static int xe_ttm_vram_reserve_page_at_addr(struct xe_device *xe, unsigned long addr,
> +					    struct xe_ttm_vram_mgr *vram_mgr, struct gpu_buddy *mm)
> +{
> +	struct xe_ttm_vram_offline_resource *nentry;
> +	struct ttm_buffer_object *tbo = NULL;
> +	struct gpu_buddy_block *block;
> +	struct gpu_buddy_block *b, *m;
> +	enum reserve_status {
> +		pending = 0,
> +		fail
> +	};
> +	u64 size = SZ_4K;
> +	int ret = 0;
> +
> +	mutex_lock(&vram_mgr->lock);

You’re going to have to fix the locking here. For example, the lock is
released inside nested if statements below, which makes this function
very difficult to follow. Personally, I can’t really focus on anything
else until this is cleaned up. I’m not saying we don’t already have bad
locking patterns in Xe—I’m sure we do—but let’s avoid introducing new
code with those patterns.

For example, it should look more like this:

mutex_lock(&vram_mgr->lock);
/* Do the minimal work that requires the lock */
mutex_unlock(&vram_mgr->lock);

/* Do other work where &vram_mgr->lock needs to be dropped */

mutex_lock(&vram_mgr->lock);
/* Do more work that requires the lock */
mutex_unlock(&vram_mgr->lock);

Also strongly prefer guards or scoped_guards too.

> +	block = gpu_buddy_addr_to_block(mm, addr);
> +	if (PTR_ERR(block) == -ENXIO) {
> +		mutex_unlock(&vram_mgr->lock);
> +		return -ENXIO;
> +	}
> +
> +	nentry = kzalloc_obj(*nentry);
> +	if (!nentry)
> +		return -ENOMEM;
> +	INIT_LIST_HEAD(&nentry->blocks);
> +	nentry->status = pending;
> +
> +	if (block) {
> +		struct xe_ttm_vram_offline_resource *pos, *n;
> +		struct xe_bo *pbo;
> +
> +		WARN_ON(!block->private);
> +		tbo = block->private;
> +		pbo = ttm_to_xe_bo(tbo);
> +
> +		xe_bo_get(pbo);

This probably needs a kref get if it’s non‑zero. If this is a zombie BO,
it should already be getting destroyed. Also, we’re going to need to
look into gutting the TTM pipeline as well, where TTM resources are
transferred to different BOs—but there’s enough to clean up here first
before we get to that.

I'm going to stop here as there is quite a bit to cleanup / simplify
before I can dig in more.

Matt

> +		/* Critical kernel BO? */
> +		if (pbo->ttm.type == ttm_bo_type_kernel &&
> +		    (!(pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM) ||
> +		     is_ttm_vram_migrate_lrc(xe, pbo))) {
> +			mutex_unlock(&vram_mgr->lock);
> +			kfree(nentry);
> +			xe_ttm_vram_free_bad_pages(&xe->drm, vram_mgr);
> +			xe_bo_put(pbo);
> +			drm_err(&xe->drm,
> +				"%s: corrupt addr: 0x%lx in critical kernel bo, request reset\n",
> +				__func__, addr);
> +			/* Hint System controller driver for reset with -EIO  */
> +			return -EIO;
> +		}
> +		nentry->id = ++vram_mgr->n_queued_pages;
> +		list_add(&nentry->queued_link, &vram_mgr->queued_pages);
> +		mutex_unlock(&vram_mgr->lock);
> +
> +		/* Purge BO containing address */
> +		 xe_ttm_vram_purge_page(xe, pbo);
> +
> +		/* Reserve page at address addr*/
> +		mutex_lock(&vram_mgr->lock);
> +		ret = gpu_buddy_alloc_blocks(mm, addr, addr + size,
> +					     size, size, &nentry->blocks,
> +					     GPU_BUDDY_RANGE_ALLOCATION);
> +
> +		if (ret) {
> +			drm_warn(&xe->drm, "Could not reserve page at addr:0x%lx, ret:%d\n",
> +				 addr, ret);
> +			nentry->status = fail;
> +			mutex_unlock(&vram_mgr->lock);
> +			return ret;
> +		}
> +
> +		list_for_each_entry_safe(b, m, &nentry->blocks, link)
> +			b->private = NULL;
> +
> +		if ((addr + size) <= vram_mgr->visible_size) {
> +			nentry->used_visible_size = size;
> +		} else {
> +			list_for_each_entry(b, &nentry->blocks, link) {
> +				u64 start = gpu_buddy_block_offset(b);
> +
> +				if (start < vram_mgr->visible_size) {
> +					u64 end = start + gpu_buddy_block_size(mm, b);
> +
> +					nentry->used_visible_size +=
> +						min(end, vram_mgr->visible_size) - start;
> +				}
> +			}
> +		}
> +		vram_mgr->visible_avail -= nentry->used_visible_size;
> +		list_for_each_entry_safe(pos, n, &vram_mgr->queued_pages, queued_link) {
> +			if (pos->id == nentry->id) {
> +				--vram_mgr->n_queued_pages;
> +				list_del(&pos->queued_link);
> +				break;
> +			}
> +		}
> +		list_add(&nentry->offlined_link, &vram_mgr->offlined_pages);
> +		/* TODO: FW Integration: Send command to FW for offlining page */
> +		++vram_mgr->n_offlined_pages;
> +		mutex_unlock(&vram_mgr->lock);
> +		return ret;
> +
> +	} else {
> +		ret = gpu_buddy_alloc_blocks(mm, addr, addr + size,
> +					     size, size, &nentry->blocks,
> +					     GPU_BUDDY_RANGE_ALLOCATION);
> +		if (ret) {
> +			drm_warn(&xe->drm, "Could not reserve page at addr:0x%lx, ret:%d\n",
> +				 addr, ret);
> +			nentry->status = fail;
> +			mutex_unlock(&vram_mgr->lock);
> +			return ret;
> +		}
> +
> +		list_for_each_entry_safe(b, m, &nentry->blocks, link)
> +			b->private = NULL;
> +
> +		if ((addr + size) <= vram_mgr->visible_size) {
> +			nentry->used_visible_size = size;
> +		} else {
> +			struct gpu_buddy_block *block;
> +
> +			list_for_each_entry(block, &nentry->blocks, link) {
> +				u64 start = gpu_buddy_block_offset(block);
> +
> +				if (start < vram_mgr->visible_size) {
> +					u64 end = start + gpu_buddy_block_size(mm, block);
> +
> +					nentry->used_visible_size +=
> +						min(end, vram_mgr->visible_size) - start;
> +				}
> +			}
> +		}
> +		vram_mgr->visible_avail -= nentry->used_visible_size;
> +		nentry->id = ++vram_mgr->n_offlined_pages;
> +		list_add(&nentry->offlined_link, &vram_mgr->offlined_pages);
> +		/* TODO: FW Integration: Send command to FW for offlining page */
> +		mutex_unlock(&vram_mgr->lock);
> +	}
> +	/* Success */
> +	return ret;
> +}
> +
> +static struct xe_vram_region *xe_ttm_vram_addr_to_region(struct xe_device *xe,
> +							 resource_size_t addr)
> +{
> +	unsigned long stolen_base = xe_ttm_stolen_gpu_offset(xe);
> +	struct xe_vram_region *vr;
> +	struct xe_tile *tile;
> +	int id;
> +
> +	/* Addr from stolen memory? */
> +	if (addr + SZ_4K >= stolen_base)
> +		return NULL;
> +
> +	for_each_tile(tile, xe, id) {
> +		vr = tile->mem.vram;
> +		if ((addr <= vr->dpa_base + vr->actual_physical_size) &&
> +		    (addr + SZ_4K >= vr->dpa_base))
> +			return vr;
> +	}
> +	return NULL;
> +}
> +
> +/**
> + * xe_ttm_vram_handle_addr_fault - Handle vram physical address error flaged
> + * @xe: pointer to parent device
> + * @addr: physical faulty address
> + *
> + * Handle the physcial faulty address error on specific tile.
> + *
> + * Returns 0 for success, negative error code otherwise.
> + */
> +int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long addr)
> +{
> +	struct xe_ttm_vram_mgr *vram_mgr;
> +	struct xe_vram_region *vr;
> +	struct gpu_buddy *mm;
> +	int ret;
> +
> +	vr = xe_ttm_vram_addr_to_region(xe, addr);
> +	if (!vr) {
> +		drm_err(&xe->drm, "%s:%d addr:%lx error requesting SBR\n",
> +			__func__, __LINE__, addr);
> +		/* Hint System controller driver for reset with -EIO  */
> +		return -EIO;
> +	}
> +	vram_mgr = &vr->ttm;
> +	mm = &vram_mgr->mm;
> +	/* Reserve page at address */
> +	ret = xe_ttm_vram_reserve_page_at_addr(xe, addr, vram_mgr, mm);
> +	return ret;
> +}
> +EXPORT_SYMBOL(xe_ttm_vram_handle_addr_fault);
> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> index 87b7fae5edba..8ef06d9d44f7 100644
> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> @@ -31,6 +31,7 @@ u64 xe_ttm_vram_get_cpu_visible_size(struct ttm_resource_manager *man);
>  void xe_ttm_vram_get_used(struct ttm_resource_manager *man,
>  			  u64 *used, u64 *used_visible);
>  
> +int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long addr);
>  static inline struct xe_ttm_vram_mgr_resource *
>  to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)
>  {
> diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> index 9106da056b49..94eaf9d875f1 100644
> --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> @@ -19,6 +19,14 @@ struct xe_ttm_vram_mgr {
>  	struct ttm_resource_manager manager;
>  	/** @mm: DRM buddy allocator which manages the VRAM */
>  	struct gpu_buddy mm;
> +	/** @offlined_pages: List of offlined pages */
> +	struct list_head offlined_pages;
> +	/** @n_offlined_pages: Number of offlined pages */
> +	u16 n_offlined_pages;
> +	/** @queued_pages: List of queued pages */
> +	struct list_head queued_pages;
> +	/** @n_queued_pages: Number of queued pages */
> +	u16 n_queued_pages;
>  	/** @visible_size: Proped size of the CPU visible portion */
>  	u64 visible_size;
>  	/** @visible_avail: CPU visible portion still unallocated */
> @@ -45,4 +53,22 @@ struct xe_ttm_vram_mgr_resource {
>  	unsigned long flags;
>  };
>  
> +/**
> + * struct xe_ttm_vram_offline_resource - Xe TTM VRAM offline  resource
> + */
> +struct xe_ttm_vram_offline_resource {
> +	/** @offlined_link: Link to offlined pages */
> +	struct list_head offlined_link;
> +	/** @queued_link: Link to queued pages */
> +	struct list_head queued_link;
> +	/** @blocks: list of DRM buddy blocks */
> +	struct list_head blocks;
> +	/** @used_visible_size: How many CPU visible bytes this resource is using */
> +	u64 used_visible_size;
> +	/** @id: The id of an offline resource */
> +	u16 id;
> +	/** @status: reservation status of resource */
> +	bool status;
> +};
> +
>  #endif
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [RFC PATCH V6 1/7] drm/xe: Link VRAM object with gpu buddy
  2026-04-01 23:56   ` Matthew Brost
@ 2026-04-02  9:10     ` Upadhyay, Tejas
  2026-04-02 20:50       ` Matthew Brost
  0 siblings, 1 reply; 24+ messages in thread
From: Upadhyay, Tejas @ 2026-04-02  9:10 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Auld,  Matthew,
	thomas.hellstrom@linux.intel.com, Ghimiray, Himal Prasad



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: 02 April 2026 05:27
> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Auld, Matthew
> <matthew.auld@intel.com>; thomas.hellstrom@linux.intel.com; Ghimiray,
> Himal Prasad <himal.prasad.ghimiray@intel.com>
> Subject: Re: [RFC PATCH V6 1/7] drm/xe: Link VRAM object with gpu buddy
> 
> On Fri, Mar 27, 2026 at 05:18:14PM +0530, Tejas Upadhyay wrote:
> > Setup to link TTM buffer object inside gpu buddy. This functionality
> > is critical for supporting the memory page offline feature on CRI,
> > where identified faulty pages must be traced back to their originating
> > buffer for safe removal.
> >
> 
> I just checked the SVM code is still setting 'block->private'.
> 
> I thought we had patch for that but I don't see it in this series.

Yeah, you are right, it was there when I sent earlier revision, I somehow missed adding that patch in this series. I will add it back in next revision. Sorry for confusion here.

Tejas
> 
> Matt
> 
> > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > index 5fd0d5506a7e..c627dbf94552 100644
> > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > @@ -54,6 +54,7 @@ static int xe_ttm_vram_mgr_new(struct
> ttm_resource_manager *man,
> >  	struct xe_ttm_vram_mgr *mgr = to_xe_ttm_vram_mgr(man);
> >  	struct xe_ttm_vram_mgr_resource *vres;
> >  	struct gpu_buddy *mm = &mgr->mm;
> > +	struct gpu_buddy_block *block;
> >  	u64 size, min_page_size;
> >  	unsigned long lpfn;
> >  	int err;
> > @@ -138,6 +139,8 @@ static int xe_ttm_vram_mgr_new(struct
> ttm_resource_manager *man,
> >  	}
> >
> >  	mgr->visible_avail -= vres->used_visible_size;
> > +	list_for_each_entry(block, &vres->blocks, link)
> > +		block->private = tbo;
> >  	mutex_unlock(&mgr->lock);
> >
> >  	if (!(vres->base.placement & TTM_PL_FLAG_CONTIGUOUS) &&
> > --
> > 2.52.0
> >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH V6 2/7] drm/gpu: Add gpu_buddy_addr_to_block helper
  2026-03-27 11:48 ` [RFC PATCH V6 2/7] drm/gpu: Add gpu_buddy_addr_to_block helper Tejas Upadhyay
  2026-04-02  0:09   ` Matthew Brost
@ 2026-04-02  9:12   ` Matthew Auld
  1 sibling, 0 replies; 24+ messages in thread
From: Matthew Auld @ 2026-04-02  9:12 UTC (permalink / raw)
  To: Tejas Upadhyay, intel-xe
  Cc: matthew.brost, thomas.hellstrom, himal.prasad.ghimiray

On 27/03/2026 11:48, Tejas Upadhyay wrote:
> Add helper with primary purpose is to efficiently trace a specific
> physical memory address back to its corresponding TTM buffer object.
> 
> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> ---
>   drivers/gpu/buddy.c       | 56 +++++++++++++++++++++++++++++++++++++++
>   include/linux/gpu_buddy.h |  2 ++
>   2 files changed, 58 insertions(+)
> 
> diff --git a/drivers/gpu/buddy.c b/drivers/gpu/buddy.c
> index 52686672e99f..2d26c2a0f971 100644
> --- a/drivers/gpu/buddy.c
> +++ b/drivers/gpu/buddy.c
> @@ -589,6 +589,62 @@ void gpu_buddy_free_block(struct gpu_buddy *mm,
>   }
>   EXPORT_SYMBOL(gpu_buddy_free_block);
>   
> +/**
> + * gpu_buddy_addr_to_block - given physical address find a block
> + *
> + * @mm: GPU buddy manager
> + * @addr: Physical address

Maybe mention that this is relative addr, and not necessarily the real 
physical address?

> + *
> + * Returns:
> + * gpu_buddy_block on success, NULL or error code on failure
> + */
> +struct gpu_buddy_block *gpu_buddy_addr_to_block(struct gpu_buddy *mm, u64 addr)
> +{
> +	struct gpu_buddy_block *block;
> +	LIST_HEAD(dfs);
> +	u64 end;
> +	int i;
> +
> +	end = addr + SZ_4K - 1;
> +	for (i = 0; i < mm->n_roots; ++i)
> +		list_add_tail(&mm->roots[i]->tmp_link, &dfs);
> +
> +	do {
> +		u64 block_start;
> +		u64 block_end;
> +
> +		block = list_first_entry_or_null(&dfs,
> +						 struct gpu_buddy_block,
> +						 tmp_link);
> +		if (!block)
> +			break;
> +
> +		list_del(&block->tmp_link);
> +
> +		block_start = gpu_buddy_block_offset(block);
> +		block_end = block_start + gpu_buddy_block_size(mm, block) - 1;
> +
> +		if (!overlaps(addr, end, block_start, block_end))
> +			continue;
> +
> +		if (contains(addr, end, block_start, block_end) &&
> +		    !gpu_buddy_block_is_split(block)) {
> +			if (gpu_buddy_block_is_free(block))
> +				return NULL;
> +			else if (gpu_buddy_block_is_allocated(block) && !mm->clear_avail)

I think we need to drop the clear_avail check here, or this won't work 
correctly with clear/dirty tracking (currently xe doesn't use it).

> +				return block;

Should the function doc be updated? This will only find allocated 
blocks, but the doc doesn't make that clear. Maybe also 
gpu_buddy_allocated_addr_to_block() to make this even clearer ? Or maybe 
gpu_buddy_allocated_addr_to_priv(), if we don't actually care about the 
block itself?

Other thing we need is a small testcase for this new interface in 
gpu_buddy selftest, but maybe it makes sense to hold off on that until 
we get input from Arun etc. in case they have feedback on the interface 
itself.

> +		}
> +
> +		if (gpu_buddy_block_is_split(block)) {
> +			list_add(&block->right->tmp_link, &dfs);
> +			list_add(&block->left->tmp_link, &dfs);
> +		}
> +	} while (1);
> +
> +	return ERR_PTR(-ENXIO);
> +}
> +EXPORT_SYMBOL(gpu_buddy_addr_to_block);
> +
>   static void __gpu_buddy_free_list(struct gpu_buddy *mm,
>   				  struct list_head *objects,
>   				  bool mark_clear,
> diff --git a/include/linux/gpu_buddy.h b/include/linux/gpu_buddy.h
> index 5fa917ba5450..957c69c560bc 100644
> --- a/include/linux/gpu_buddy.h
> +++ b/include/linux/gpu_buddy.h
> @@ -231,6 +231,8 @@ void gpu_buddy_reset_clear(struct gpu_buddy *mm, bool is_clear);
>   
>   void gpu_buddy_free_block(struct gpu_buddy *mm, struct gpu_buddy_block *block);
>   
> +struct gpu_buddy_block *gpu_buddy_addr_to_block(struct gpu_buddy *mm, u64 addr);
> +
>   void gpu_buddy_free_list(struct gpu_buddy *mm,
>   			 struct list_head *objects,
>   			 unsigned int flags);


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH V6 2/7] drm/gpu: Add gpu_buddy_addr_to_block helper
  2026-04-02  0:09   ` Matthew Brost
@ 2026-04-02 10:16     ` Matthew Auld
  0 siblings, 0 replies; 24+ messages in thread
From: Matthew Auld @ 2026-04-02 10:16 UTC (permalink / raw)
  To: Matthew Brost, Tejas Upadhyay
  Cc: intel-xe, thomas.hellstrom, himal.prasad.ghimiray

On 02/04/2026 01:09, Matthew Brost wrote:
> On Fri, Mar 27, 2026 at 05:18:15PM +0530, Tejas Upadhyay wrote:
>> Add helper with primary purpose is to efficiently trace a specific
>> physical memory address back to its corresponding TTM buffer object.
>>
>> Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
>> ---
>>   drivers/gpu/buddy.c       | 56 +++++++++++++++++++++++++++++++++++++++
>>   include/linux/gpu_buddy.h |  2 ++
>>   2 files changed, 58 insertions(+)
>>
>> diff --git a/drivers/gpu/buddy.c b/drivers/gpu/buddy.c
>> index 52686672e99f..2d26c2a0f971 100644
>> --- a/drivers/gpu/buddy.c
>> +++ b/drivers/gpu/buddy.c
>> @@ -589,6 +589,62 @@ void gpu_buddy_free_block(struct gpu_buddy *mm,
>>   }
>>   EXPORT_SYMBOL(gpu_buddy_free_block);
>>   
>> +/**
>> + * gpu_buddy_addr_to_block - given physical address find a block
>> + *
>> + * @mm: GPU buddy manager
>> + * @addr: Physical address
>> + *
>> + * Returns:
>> + * gpu_buddy_block on success, NULL or error code on failure
>> + */
>> +struct gpu_buddy_block *gpu_buddy_addr_to_block(struct gpu_buddy *mm, u64 addr)
>> +{
>> +	struct gpu_buddy_block *block;
>> +	LIST_HEAD(dfs);
>> +	u64 end;
>> +	int i;
>> +
> 
> This is somewhat of an orthogonal issue, but we really should update
> gpu_buddy so drivers can register a lock that we can assert is held in
> functions where gpu_buddy manipulates blocks.
> 
> Something like this [1].

I think would be a good addition also.

> 
> I’ll defer to the gpu_buddy maintainers on whether this code is correct.
> 
> Matt
> 
> [1] https://elixir.bootlin.com/linux/v6.19.10/source/include/drm/drm_gpusvm.h#L341
> 
>> +	end = addr + SZ_4K - 1;
>> +	for (i = 0; i < mm->n_roots; ++i)
>> +		list_add_tail(&mm->roots[i]->tmp_link, &dfs);
>> +
>> +	do {
>> +		u64 block_start;
>> +		u64 block_end;
>> +
>> +		block = list_first_entry_or_null(&dfs,
>> +						 struct gpu_buddy_block,
>> +						 tmp_link);
>> +		if (!block)
>> +			break;
>> +
>> +		list_del(&block->tmp_link);
>> +
>> +		block_start = gpu_buddy_block_offset(block);
>> +		block_end = block_start + gpu_buddy_block_size(mm, block) - 1;
>> +
>> +		if (!overlaps(addr, end, block_start, block_end))
>> +			continue;
>> +
>> +		if (contains(addr, end, block_start, block_end) &&
>> +		    !gpu_buddy_block_is_split(block)) {
>> +			if (gpu_buddy_block_is_free(block))
>> +				return NULL;
>> +			else if (gpu_buddy_block_is_allocated(block) && !mm->clear_avail)
>> +				return block;
>> +		}
>> +
>> +		if (gpu_buddy_block_is_split(block)) {
>> +			list_add(&block->right->tmp_link, &dfs);
>> +			list_add(&block->left->tmp_link, &dfs);
>> +		}
>> +	} while (1);
>> +
>> +	return ERR_PTR(-ENXIO);
>> +}
>> +EXPORT_SYMBOL(gpu_buddy_addr_to_block);
>> +
>>   static void __gpu_buddy_free_list(struct gpu_buddy *mm,
>>   				  struct list_head *objects,
>>   				  bool mark_clear,
>> diff --git a/include/linux/gpu_buddy.h b/include/linux/gpu_buddy.h
>> index 5fa917ba5450..957c69c560bc 100644
>> --- a/include/linux/gpu_buddy.h
>> +++ b/include/linux/gpu_buddy.h
>> @@ -231,6 +231,8 @@ void gpu_buddy_reset_clear(struct gpu_buddy *mm, bool is_clear);
>>   
>>   void gpu_buddy_free_block(struct gpu_buddy *mm, struct gpu_buddy_block *block);
>>   
>> +struct gpu_buddy_block *gpu_buddy_addr_to_block(struct gpu_buddy *mm, u64 addr);
>> +
>>   void gpu_buddy_free_list(struct gpu_buddy *mm,
>>   			 struct list_head *objects,
>>   			 unsigned int flags);
>> -- 
>> 2.52.0
>>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [RFC PATCH V6 3/7] drm/xe: Handle physical memory address error
  2026-04-02  1:03   ` Matthew Brost
@ 2026-04-02 10:30     ` Upadhyay, Tejas
  2026-04-02 20:20       ` Matthew Brost
  0 siblings, 1 reply; 24+ messages in thread
From: Upadhyay, Tejas @ 2026-04-02 10:30 UTC (permalink / raw)
  To: Brost, Matthew, Aravind Iddamsetty
  Cc: intel-xe@lists.freedesktop.org, Auld,  Matthew,
	thomas.hellstrom@linux.intel.com, Ghimiray, Himal Prasad



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: 02 April 2026 06:34
> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Auld, Matthew
> <matthew.auld@intel.com>; thomas.hellstrom@linux.intel.com; Ghimiray,
> Himal Prasad <himal.prasad.ghimiray@intel.com>
> Subject: Re: [RFC PATCH V6 3/7] drm/xe: Handle physical memory address
> error
> 
> On Fri, Mar 27, 2026 at 05:18:16PM +0530, Tejas Upadhyay wrote:
> > This functionality represents a significant step in making the xe
> > driver gracefully handle hardware memory degradation.
> > By integrating with the DRM Buddy allocator, the driver can
> > permanently "carve out" faulty memory so it isn't reused by subsequent
> > allocations.
> >
> > Buddy Block Reservation:
> > ----------------------
> > When a memory address is reported as faulty, the driver instructs the
> > DRM Buddy allocator to reserve a block of the specific page size
> > (typically 4KB). This marks the memory as "dirty/used"
> > indefinitely.
> >
> > Two-Stage Tracking:
> > -----------------
> > Offlined Pages:
> > Pages that have been successfully isolated and removed from the
> > available memory pool.
> >
> > Queued Pages:
> > Addresses that have been flagged as faulty but are currently in use by
> > a process. These are tracked until the associated buffer object (BO)
> > is released or migrated, at which point they move to the "offlined"
> > state.
> >
> > Sysfs Reporting:
> > --------------
> > The patch exposes these metrics through a standard interface, allowing
> > administrators to monitor VRAM health:
> > /sys/bus/pci/devices/<device_id>/vram_bad_bad_pages
> >
> > V5:
> > - Categorise and handle BOs accordingly
> > - Fix crash found with new debugfs tests
> > V4:
> > - Set block->private NULL post bo purge
> > - Filter out gsm address early on
> > - Rebase
> > V3:
> > -rename api, remove tile dependency and add status of reservation
> > V2:
> > - Fix mm->avail counter issue
> > - Remove unused code and handle clean up in case of error
> >
> > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       | 336
> +++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.h       |   1 +
> >  drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  26 ++
> >  3 files changed, 363 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > index c627dbf94552..0fec7b332501 100644
> > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > @@ -13,7 +13,10 @@
> >
> >  #include "xe_bo.h"
> >  #include "xe_device.h"
> > +#include "xe_exec_queue.h"
> > +#include "xe_lrc.h"
> >  #include "xe_res_cursor.h"
> > +#include "xe_ttm_stolen_mgr.h"
> >  #include "xe_ttm_vram_mgr.h"
> >  #include "xe_vram_types.h"
> >
> > @@ -277,6 +280,26 @@ static const struct ttm_resource_manager_func
> xe_ttm_vram_mgr_func = {
> >  	.debug	= xe_ttm_vram_mgr_debug
> >  };
> >
> > +static void xe_ttm_vram_free_bad_pages(struct drm_device *dev, struct
> > +xe_ttm_vram_mgr *mgr) {
> > +	struct xe_ttm_vram_offline_resource *pos, *n;
> > +
> > +	mutex_lock(&mgr->lock);
> > +	list_for_each_entry_safe(pos, n, &mgr->offlined_pages, offlined_link)
> {
> > +		--mgr->n_offlined_pages;
> > +		gpu_buddy_free_list(&mgr->mm, &pos->blocks, 0);
> > +		mgr->visible_avail += pos->used_visible_size;
> > +		list_del(&pos->offlined_link);
> > +		kfree(pos);
> > +	}
> > +	list_for_each_entry_safe(pos, n, &mgr->queued_pages, queued_link)
> {
> > +		list_del(&pos->queued_link);
> > +		mgr->n_queued_pages--;
> > +		kfree(pos);
> > +	}
> > +	mutex_unlock(&mgr->lock);
> > +}
> > +
> >  static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
> > {
> >  	struct xe_device *xe = to_xe_device(dev); @@ -288,6 +311,8 @@
> static
> > void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
> >  	if (ttm_resource_manager_evict_all(&xe->ttm, man))
> >  		return;
> >
> > +	xe_ttm_vram_free_bad_pages(dev, mgr);
> > +
> >  	WARN_ON_ONCE(mgr->visible_avail != mgr->visible_size);
> >
> >  	gpu_buddy_fini(&mgr->mm);
> > @@ -316,6 +341,8 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe,
> struct xe_ttm_vram_mgr *mgr,
> >  	man->func = &xe_ttm_vram_mgr_func;
> >  	mgr->mem_type = mem_type;
> >  	mutex_init(&mgr->lock);
> > +	INIT_LIST_HEAD(&mgr->offlined_pages);
> > +	INIT_LIST_HEAD(&mgr->queued_pages);
> >  	mgr->default_page_size = default_page_size;
> >  	mgr->visible_size = io_size;
> >  	mgr->visible_avail = io_size;
> > @@ -471,3 +498,312 @@ u64 xe_ttm_vram_get_avail(struct
> > ttm_resource_manager *man)
> >
> >  	return avail;
> >  }
> > +
> > +static bool is_ttm_vram_migrate_lrc(struct xe_device *xe, struct
> > +xe_bo *pbo)
> 
> As discussed in prior reply [1] - I think this can be dropped.
> 
> [1]
> https://patchwork.freedesktop.org/patch/714756/?series=161473&rev=6#c
> omment_1318048
> 
> > +{
> > +	if (pbo->ttm.type == ttm_bo_type_kernel &&
> > +	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
> > +	    (pbo->flags & (XE_BO_FLAG_GGTT |
> XE_BO_FLAG_GGTT_INVALIDATE)) &&
> > +	    !(pbo->flags & XE_BO_FLAG_PAGETABLE)) {
> > +		unsigned long idx;
> > +		struct xe_exec_queue *q;
> > +		struct drm_device *dev = &xe->drm;
> > +		struct drm_file *file;
> > +		struct xe_lrc *lrc;
> > +
> > +		/* TODO : Need to extend to multitile in future if needed */
> > +		mutex_lock(&dev->filelist_mutex);
> > +		list_for_each_entry(file, &dev->filelist, lhead) {
> > +			struct xe_file *xef = file->driver_priv;
> > +
> > +			mutex_lock(&xef->exec_queue.lock);
> > +			xa_for_each(&xef->exec_queue.xa, idx, q) {
> > +				xe_exec_queue_get(q);
> > +				mutex_unlock(&xef->exec_queue.lock);
> > +
> > +				for (int i = 0; i < q->width; i++) {
> > +					lrc = xe_exec_queue_get_lrc(q, i);
> > +					if (lrc->bo == pbo) {
> > +						xe_lrc_put(lrc);
> > +						mutex_lock(&xef-
> >exec_queue.lock);
> > +						xe_exec_queue_put(q);
> > +						mutex_unlock(&xef-
> >exec_queue.lock);
> > +						mutex_unlock(&dev-
> >filelist_mutex);
> > +						return false;
> > +					}
> > +					xe_lrc_put(lrc);
> > +				}
> > +				mutex_lock(&xef->exec_queue.lock);
> > +				xe_exec_queue_put(q);
> > +				mutex_unlock(&xef->exec_queue.lock);
> > +			}
> > +		}
> > +		mutex_unlock(&dev->filelist_mutex);
> > +		return true;
> > +	}
> > +	return false;
> > +}
> > +
> > +static void xe_ttm_vram_purge_page(struct xe_device *xe, struct xe_bo
> > +*pbo) {
> > +	struct ttm_placement place = {};
> > +	struct ttm_operation_ctx ctx = {
> > +		.interruptible = false,
> > +		.gfp_retry_mayfail = false,
> > +	};
> > +	bool locked;
> > +	int ret = 0;
> > +
> > +	/*  Ban VM if BO is PPGTT */
> > +	if (pbo->ttm.type == ttm_bo_type_kernel &&
> > +	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
> > +	    pbo->flags & XE_BO_FLAG_PAGETABLE) {
> 
> I think XE_BO_FLAG_PAGETABLE and XE_BO_FLAG_FORCE_USER_VRAM are
> sufficient here.
> 
> Also, if XE_BO_FLAG_PAGETABLE is set but XE_BO_FLAG_FORCE_USER_VRAM
> is clear, that means this is a kernel VM and we probably have to wedge the
> device, right?

I am looking at all other review comments , meanwhile as quick response, @Aravind Iddamsetty was suggesting to do SBR reset for critical BOs in place of wedge.

Tejas
> 
> > +		down_write(&pbo->vm->lock);
> > +		xe_vm_kill(pbo->vm, true);
> > +		up_write(&pbo->vm->lock);
> > +	}
> > +
> > +	/*  Ban exec queue if BO is lrc */
> > +	if (pbo->ttm.type == ttm_bo_type_kernel &&
> > +	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
> > +	    (pbo->flags & (XE_BO_FLAG_GGTT |
> XE_BO_FLAG_GGTT_INVALIDATE)) &&
> > +	    !(pbo->flags & XE_BO_FLAG_PAGETABLE)) {
> 
> 
> This is a huge if statement just to determine whether this is an LRC. At a
> minimum, we’d need to normalize this, and it looks very fragile—if we change
> flags elsewhere in the driver, this if statement could easily break.
> 
> Also, I can’t say I’m a fan of searching just to kill an individual queue.
> 
> It’s a bit unfortunate that LRCs are created without a VM (I forget the exact
> reasoning, but I seem to recall it was related to multi-q?)
> 
> I think what we really want to do is:
> 
> - If we find a PT or LRC BO, kill the VM.
> - Update ‘kill VM’ to kill all exec queues. I honestly forget why we
>   only kill preempt/rebind queues—it’s likely some nonsensical reasoning
>   that we never cleaned up. We already have xe_vm_add_exec_queue(), which
>   is short-circuited on xe->info.has_ctx_tlb_inval, but we can just
>   remove that.
> - Normalize this with an LRC BO flag and store the user_vm in the BO for
>   LRCs.
> - Critical kernel BOs normalized with BO flag -> wedge the device
> 
> The difference between killing a queue and killing a VM doesn’t really matter
> from a user-space point of view, since typically a single-queue hang leads to
> the entire process crashing or restarting—at least for Mesa 3D. We should
> confirm with compute whether this is also what we’re targeting for CRI, but I
> suspect the answer is the same. Even if it isn’t, I’m not convinced per-queue
> killing is worthwhile. And if we decide it is, the filelist / exec_queue.xa search is
> pretty much a non-starter for me—for example, we’d need to make this much
> simpler and avoid taking a bunch of locks here, which looks pretty scary.
> 
> > +		struct drm_device *dev = &xe->drm;
> > +		struct xe_exec_queue *q;
> > +		struct drm_file *file;
> > +		struct xe_lrc *lrc;
> > +		unsigned long idx;
> > +
> > +		/* TODO : Need to extend to multitile in future if needed */
> > +		mutex_lock(&dev->filelist_mutex);
> > +		list_for_each_entry(file, &dev->filelist, lhead) {
> > +			struct xe_file *xef = file->driver_priv;
> > +
> > +			mutex_lock(&xef->exec_queue.lock);
> > +			xa_for_each(&xef->exec_queue.xa, idx, q) {
> > +				xe_exec_queue_get(q);
> > +				mutex_unlock(&xef->exec_queue.lock);
> > +
> > +				for (int i = 0; i < q->width; i++) {
> > +					lrc = xe_exec_queue_get_lrc(q, i);
> > +					if (lrc->bo == pbo) {
> > +						xe_lrc_put(lrc);
> > +						xe_exec_queue_kill(q);
> > +					} else {
> > +						xe_lrc_put(lrc);
> > +					}
> > +				}
> > +
> > +				mutex_lock(&xef->exec_queue.lock);
> > +				xe_exec_queue_put(q);
> > +				mutex_unlock(&xef->exec_queue.lock);
> > +			}
> > +		}
> > +		mutex_unlock(&dev->filelist_mutex);
> > +	}
> > +
> > +	spin_lock(&pbo->ttm.bdev->lru_lock);
> > +	locked = dma_resv_trylock(pbo->ttm.base.resv);
> > +	spin_unlock(&pbo->ttm.bdev->lru_lock);
> > +	WARN_ON(!locked);
> 
> Is there any reason why we can’t just take a sleeping dma_resv_lock here (e.g.
> xe_bo_lock)? Also, I think the trick with the LRU lock only works once the BO’s
> dma_resv has been individualized (kref == 0), which is clearly not the case
> here.
> 
> > +	ret = ttm_bo_validate(&pbo->ttm, &place, &ctx);
> > +	drm_WARN_ON(&xe->drm, ret);
> > +	xe_bo_put(pbo);
> > +	if (locked)
> > +		dma_resv_unlock(pbo->ttm.base.resv);
> > +}
> > +
> > +static int xe_ttm_vram_reserve_page_at_addr(struct xe_device *xe,
> unsigned long addr,
> > +					    struct xe_ttm_vram_mgr
> *vram_mgr, struct gpu_buddy *mm) {
> > +	struct xe_ttm_vram_offline_resource *nentry;
> > +	struct ttm_buffer_object *tbo = NULL;
> > +	struct gpu_buddy_block *block;
> > +	struct gpu_buddy_block *b, *m;
> > +	enum reserve_status {
> > +		pending = 0,
> > +		fail
> > +	};
> > +	u64 size = SZ_4K;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&vram_mgr->lock);
> 
> You’re going to have to fix the locking here. For example, the lock is released
> inside nested if statements below, which makes this function very difficult to
> follow. Personally, I can’t really focus on anything else until this is cleaned up.
> I’m not saying we don’t already have bad locking patterns in Xe—I’m sure we
> do—but let’s avoid introducing new code with those patterns.
> 
> For example, it should look more like this:
> 
> mutex_lock(&vram_mgr->lock);
> /* Do the minimal work that requires the lock */ mutex_unlock(&vram_mgr-
> >lock);
> 
> /* Do other work where &vram_mgr->lock needs to be dropped */
> 
> mutex_lock(&vram_mgr->lock);
> /* Do more work that requires the lock */ mutex_unlock(&vram_mgr->lock);
> 
> Also strongly prefer guards or scoped_guards too.
> 
> > +	block = gpu_buddy_addr_to_block(mm, addr);
> > +	if (PTR_ERR(block) == -ENXIO) {
> > +		mutex_unlock(&vram_mgr->lock);
> > +		return -ENXIO;
> > +	}
> > +
> > +	nentry = kzalloc_obj(*nentry);
> > +	if (!nentry)
> > +		return -ENOMEM;
> > +	INIT_LIST_HEAD(&nentry->blocks);
> > +	nentry->status = pending;
> > +
> > +	if (block) {
> > +		struct xe_ttm_vram_offline_resource *pos, *n;
> > +		struct xe_bo *pbo;
> > +
> > +		WARN_ON(!block->private);
> > +		tbo = block->private;
> > +		pbo = ttm_to_xe_bo(tbo);
> > +
> > +		xe_bo_get(pbo);
> 
> This probably needs a kref get if it’s non‑zero. If this is a zombie BO, it should
> already be getting destroyed. Also, we’re going to need to look into gutting the
> TTM pipeline as well, where TTM resources are transferred to different BOs—
> but there’s enough to clean up here first before we get to that.
> 
> I'm going to stop here as there is quite a bit to cleanup / simplify before I can
> dig in more.
> 
> Matt
> 
> > +		/* Critical kernel BO? */
> > +		if (pbo->ttm.type == ttm_bo_type_kernel &&
> > +		    (!(pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM) ||
> > +		     is_ttm_vram_migrate_lrc(xe, pbo))) {
> > +			mutex_unlock(&vram_mgr->lock);
> > +			kfree(nentry);
> > +			xe_ttm_vram_free_bad_pages(&xe->drm, vram_mgr);
> > +			xe_bo_put(pbo);
> > +			drm_err(&xe->drm,
> > +				"%s: corrupt addr: 0x%lx in critical kernel bo,
> request reset\n",
> > +				__func__, addr);
> > +			/* Hint System controller driver for reset with -EIO  */
> > +			return -EIO;
> > +		}
> > +		nentry->id = ++vram_mgr->n_queued_pages;
> > +		list_add(&nentry->queued_link, &vram_mgr-
> >queued_pages);
> > +		mutex_unlock(&vram_mgr->lock);
> > +
> > +		/* Purge BO containing address */
> > +		 xe_ttm_vram_purge_page(xe, pbo);
> > +
> > +		/* Reserve page at address addr*/
> > +		mutex_lock(&vram_mgr->lock);
> > +		ret = gpu_buddy_alloc_blocks(mm, addr, addr + size,
> > +					     size, size, &nentry->blocks,
> > +
> GPU_BUDDY_RANGE_ALLOCATION);
> > +
> > +		if (ret) {
> > +			drm_warn(&xe->drm, "Could not reserve page at
> addr:0x%lx, ret:%d\n",
> > +				 addr, ret);
> > +			nentry->status = fail;
> > +			mutex_unlock(&vram_mgr->lock);
> > +			return ret;
> > +		}
> > +
> > +		list_for_each_entry_safe(b, m, &nentry->blocks, link)
> > +			b->private = NULL;
> > +
> > +		if ((addr + size) <= vram_mgr->visible_size) {
> > +			nentry->used_visible_size = size;
> > +		} else {
> > +			list_for_each_entry(b, &nentry->blocks, link) {
> > +				u64 start = gpu_buddy_block_offset(b);
> > +
> > +				if (start < vram_mgr->visible_size) {
> > +					u64 end = start +
> gpu_buddy_block_size(mm, b);
> > +
> > +					nentry->used_visible_size +=
> > +						min(end, vram_mgr-
> >visible_size) - start;
> > +				}
> > +			}
> > +		}
> > +		vram_mgr->visible_avail -= nentry->used_visible_size;
> > +		list_for_each_entry_safe(pos, n, &vram_mgr->queued_pages,
> queued_link) {
> > +			if (pos->id == nentry->id) {
> > +				--vram_mgr->n_queued_pages;
> > +				list_del(&pos->queued_link);
> > +				break;
> > +			}
> > +		}
> > +		list_add(&nentry->offlined_link, &vram_mgr-
> >offlined_pages);
> > +		/* TODO: FW Integration: Send command to FW for offlining
> page */
> > +		++vram_mgr->n_offlined_pages;
> > +		mutex_unlock(&vram_mgr->lock);
> > +		return ret;
> > +
> > +	} else {
> > +		ret = gpu_buddy_alloc_blocks(mm, addr, addr + size,
> > +					     size, size, &nentry->blocks,
> > +
> GPU_BUDDY_RANGE_ALLOCATION);
> > +		if (ret) {
> > +			drm_warn(&xe->drm, "Could not reserve page at
> addr:0x%lx, ret:%d\n",
> > +				 addr, ret);
> > +			nentry->status = fail;
> > +			mutex_unlock(&vram_mgr->lock);
> > +			return ret;
> > +		}
> > +
> > +		list_for_each_entry_safe(b, m, &nentry->blocks, link)
> > +			b->private = NULL;
> > +
> > +		if ((addr + size) <= vram_mgr->visible_size) {
> > +			nentry->used_visible_size = size;
> > +		} else {
> > +			struct gpu_buddy_block *block;
> > +
> > +			list_for_each_entry(block, &nentry->blocks, link) {
> > +				u64 start = gpu_buddy_block_offset(block);
> > +
> > +				if (start < vram_mgr->visible_size) {
> > +					u64 end = start +
> gpu_buddy_block_size(mm, block);
> > +
> > +					nentry->used_visible_size +=
> > +						min(end, vram_mgr-
> >visible_size) - start;
> > +				}
> > +			}
> > +		}
> > +		vram_mgr->visible_avail -= nentry->used_visible_size;
> > +		nentry->id = ++vram_mgr->n_offlined_pages;
> > +		list_add(&nentry->offlined_link, &vram_mgr-
> >offlined_pages);
> > +		/* TODO: FW Integration: Send command to FW for offlining
> page */
> > +		mutex_unlock(&vram_mgr->lock);
> > +	}
> > +	/* Success */
> > +	return ret;
> > +}
> > +
> > +static struct xe_vram_region *xe_ttm_vram_addr_to_region(struct
> xe_device *xe,
> > +							 resource_size_t addr)
> > +{
> > +	unsigned long stolen_base = xe_ttm_stolen_gpu_offset(xe);
> > +	struct xe_vram_region *vr;
> > +	struct xe_tile *tile;
> > +	int id;
> > +
> > +	/* Addr from stolen memory? */
> > +	if (addr + SZ_4K >= stolen_base)
> > +		return NULL;
> > +
> > +	for_each_tile(tile, xe, id) {
> > +		vr = tile->mem.vram;
> > +		if ((addr <= vr->dpa_base + vr->actual_physical_size) &&
> > +		    (addr + SZ_4K >= vr->dpa_base))
> > +			return vr;
> > +	}
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * xe_ttm_vram_handle_addr_fault - Handle vram physical address error
> > +flaged
> > + * @xe: pointer to parent device
> > + * @addr: physical faulty address
> > + *
> > + * Handle the physcial faulty address error on specific tile.
> > + *
> > + * Returns 0 for success, negative error code otherwise.
> > + */
> > +int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long
> > +addr) {
> > +	struct xe_ttm_vram_mgr *vram_mgr;
> > +	struct xe_vram_region *vr;
> > +	struct gpu_buddy *mm;
> > +	int ret;
> > +
> > +	vr = xe_ttm_vram_addr_to_region(xe, addr);
> > +	if (!vr) {
> > +		drm_err(&xe->drm, "%s:%d addr:%lx error requesting SBR\n",
> > +			__func__, __LINE__, addr);
> > +		/* Hint System controller driver for reset with -EIO  */
> > +		return -EIO;
> > +	}
> > +	vram_mgr = &vr->ttm;
> > +	mm = &vram_mgr->mm;
> > +	/* Reserve page at address */
> > +	ret = xe_ttm_vram_reserve_page_at_addr(xe, addr, vram_mgr, mm);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(xe_ttm_vram_handle_addr_fault);
> > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > index 87b7fae5edba..8ef06d9d44f7 100644
> > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > @@ -31,6 +31,7 @@ u64 xe_ttm_vram_get_cpu_visible_size(struct
> > ttm_resource_manager *man);  void xe_ttm_vram_get_used(struct
> ttm_resource_manager *man,
> >  			  u64 *used, u64 *used_visible);
> >
> > +int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long
> > +addr);
> >  static inline struct xe_ttm_vram_mgr_resource *
> > to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)  { diff --git
> > a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> > index 9106da056b49..94eaf9d875f1 100644
> > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> > @@ -19,6 +19,14 @@ struct xe_ttm_vram_mgr {
> >  	struct ttm_resource_manager manager;
> >  	/** @mm: DRM buddy allocator which manages the VRAM */
> >  	struct gpu_buddy mm;
> > +	/** @offlined_pages: List of offlined pages */
> > +	struct list_head offlined_pages;
> > +	/** @n_offlined_pages: Number of offlined pages */
> > +	u16 n_offlined_pages;
> > +	/** @queued_pages: List of queued pages */
> > +	struct list_head queued_pages;
> > +	/** @n_queued_pages: Number of queued pages */
> > +	u16 n_queued_pages;
> >  	/** @visible_size: Proped size of the CPU visible portion */
> >  	u64 visible_size;
> >  	/** @visible_avail: CPU visible portion still unallocated */ @@
> > -45,4 +53,22 @@ struct xe_ttm_vram_mgr_resource {
> >  	unsigned long flags;
> >  };
> >
> > +/**
> > + * struct xe_ttm_vram_offline_resource - Xe TTM VRAM offline
> > +resource  */ struct xe_ttm_vram_offline_resource {
> > +	/** @offlined_link: Link to offlined pages */
> > +	struct list_head offlined_link;
> > +	/** @queued_link: Link to queued pages */
> > +	struct list_head queued_link;
> > +	/** @blocks: list of DRM buddy blocks */
> > +	struct list_head blocks;
> > +	/** @used_visible_size: How many CPU visible bytes this resource is
> using */
> > +	u64 used_visible_size;
> > +	/** @id: The id of an offline resource */
> > +	u16 id;
> > +	/** @status: reservation status of resource */
> > +	bool status;
> > +};
> > +
> >  #endif
> > --
> > 2.52.0
> >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH V6 3/7] drm/xe: Handle physical memory address error
  2026-04-02 10:30     ` Upadhyay, Tejas
@ 2026-04-02 20:20       ` Matthew Brost
  2026-04-07 12:03         ` Upadhyay, Tejas
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Brost @ 2026-04-02 20:20 UTC (permalink / raw)
  To: Upadhyay, Tejas
  Cc: Aravind Iddamsetty, intel-xe@lists.freedesktop.org, Auld, Matthew,
	thomas.hellstrom@linux.intel.com, Ghimiray, Himal Prasad

On Thu, Apr 02, 2026 at 04:30:47AM -0600, Upadhyay, Tejas wrote:
> 
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: 02 April 2026 06:34
> > To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Auld, Matthew
> > <matthew.auld@intel.com>; thomas.hellstrom@linux.intel.com; Ghimiray,
> > Himal Prasad <himal.prasad.ghimiray@intel.com>
> > Subject: Re: [RFC PATCH V6 3/7] drm/xe: Handle physical memory address
> > error
> > 
> > On Fri, Mar 27, 2026 at 05:18:16PM +0530, Tejas Upadhyay wrote:
> > > This functionality represents a significant step in making the xe
> > > driver gracefully handle hardware memory degradation.
> > > By integrating with the DRM Buddy allocator, the driver can
> > > permanently "carve out" faulty memory so it isn't reused by subsequent
> > > allocations.
> > >
> > > Buddy Block Reservation:
> > > ----------------------
> > > When a memory address is reported as faulty, the driver instructs the
> > > DRM Buddy allocator to reserve a block of the specific page size
> > > (typically 4KB). This marks the memory as "dirty/used"
> > > indefinitely.
> > >
> > > Two-Stage Tracking:
> > > -----------------
> > > Offlined Pages:
> > > Pages that have been successfully isolated and removed from the
> > > available memory pool.
> > >
> > > Queued Pages:
> > > Addresses that have been flagged as faulty but are currently in use by
> > > a process. These are tracked until the associated buffer object (BO)
> > > is released or migrated, at which point they move to the "offlined"
> > > state.
> > >
> > > Sysfs Reporting:
> > > --------------
> > > The patch exposes these metrics through a standard interface, allowing
> > > administrators to monitor VRAM health:
> > > /sys/bus/pci/devices/<device_id>/vram_bad_bad_pages
> > >
> > > V5:
> > > - Categorise and handle BOs accordingly
> > > - Fix crash found with new debugfs tests
> > > V4:
> > > - Set block->private NULL post bo purge
> > > - Filter out gsm address early on
> > > - Rebase
> > > V3:
> > > -rename api, remove tile dependency and add status of reservation
> > > V2:
> > > - Fix mm->avail counter issue
> > > - Remove unused code and handle clean up in case of error
> > >
> > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       | 336
> > +++++++++++++++++++++
> > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.h       |   1 +
> > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  26 ++
> > >  3 files changed, 363 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > index c627dbf94552..0fec7b332501 100644
> > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > @@ -13,7 +13,10 @@
> > >
> > >  #include "xe_bo.h"
> > >  #include "xe_device.h"
> > > +#include "xe_exec_queue.h"
> > > +#include "xe_lrc.h"
> > >  #include "xe_res_cursor.h"
> > > +#include "xe_ttm_stolen_mgr.h"
> > >  #include "xe_ttm_vram_mgr.h"
> > >  #include "xe_vram_types.h"
> > >
> > > @@ -277,6 +280,26 @@ static const struct ttm_resource_manager_func
> > xe_ttm_vram_mgr_func = {
> > >  	.debug	= xe_ttm_vram_mgr_debug
> > >  };
> > >
> > > +static void xe_ttm_vram_free_bad_pages(struct drm_device *dev, struct
> > > +xe_ttm_vram_mgr *mgr) {
> > > +	struct xe_ttm_vram_offline_resource *pos, *n;
> > > +
> > > +	mutex_lock(&mgr->lock);
> > > +	list_for_each_entry_safe(pos, n, &mgr->offlined_pages, offlined_link)
> > {
> > > +		--mgr->n_offlined_pages;
> > > +		gpu_buddy_free_list(&mgr->mm, &pos->blocks, 0);
> > > +		mgr->visible_avail += pos->used_visible_size;
> > > +		list_del(&pos->offlined_link);
> > > +		kfree(pos);
> > > +	}
> > > +	list_for_each_entry_safe(pos, n, &mgr->queued_pages, queued_link)
> > {
> > > +		list_del(&pos->queued_link);
> > > +		mgr->n_queued_pages--;
> > > +		kfree(pos);
> > > +	}
> > > +	mutex_unlock(&mgr->lock);
> > > +}
> > > +
> > >  static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
> > > {
> > >  	struct xe_device *xe = to_xe_device(dev); @@ -288,6 +311,8 @@
> > static
> > > void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
> > >  	if (ttm_resource_manager_evict_all(&xe->ttm, man))
> > >  		return;
> > >
> > > +	xe_ttm_vram_free_bad_pages(dev, mgr);
> > > +
> > >  	WARN_ON_ONCE(mgr->visible_avail != mgr->visible_size);
> > >
> > >  	gpu_buddy_fini(&mgr->mm);
> > > @@ -316,6 +341,8 @@ int __xe_ttm_vram_mgr_init(struct xe_device *xe,
> > struct xe_ttm_vram_mgr *mgr,
> > >  	man->func = &xe_ttm_vram_mgr_func;
> > >  	mgr->mem_type = mem_type;
> > >  	mutex_init(&mgr->lock);
> > > +	INIT_LIST_HEAD(&mgr->offlined_pages);
> > > +	INIT_LIST_HEAD(&mgr->queued_pages);
> > >  	mgr->default_page_size = default_page_size;
> > >  	mgr->visible_size = io_size;
> > >  	mgr->visible_avail = io_size;
> > > @@ -471,3 +498,312 @@ u64 xe_ttm_vram_get_avail(struct
> > > ttm_resource_manager *man)
> > >
> > >  	return avail;
> > >  }
> > > +
> > > +static bool is_ttm_vram_migrate_lrc(struct xe_device *xe, struct
> > > +xe_bo *pbo)
> > 
> > As discussed in prior reply [1] - I think this can be dropped.
> > 
> > [1]
> > https://patchwork.freedesktop.org/patch/714756/?series=161473&rev=6#c
> > omment_1318048
> > 
> > > +{
> > > +	if (pbo->ttm.type == ttm_bo_type_kernel &&
> > > +	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
> > > +	    (pbo->flags & (XE_BO_FLAG_GGTT |
> > XE_BO_FLAG_GGTT_INVALIDATE)) &&
> > > +	    !(pbo->flags & XE_BO_FLAG_PAGETABLE)) {
> > > +		unsigned long idx;
> > > +		struct xe_exec_queue *q;
> > > +		struct drm_device *dev = &xe->drm;
> > > +		struct drm_file *file;
> > > +		struct xe_lrc *lrc;
> > > +
> > > +		/* TODO : Need to extend to multitile in future if needed */
> > > +		mutex_lock(&dev->filelist_mutex);
> > > +		list_for_each_entry(file, &dev->filelist, lhead) {
> > > +			struct xe_file *xef = file->driver_priv;
> > > +
> > > +			mutex_lock(&xef->exec_queue.lock);
> > > +			xa_for_each(&xef->exec_queue.xa, idx, q) {
> > > +				xe_exec_queue_get(q);
> > > +				mutex_unlock(&xef->exec_queue.lock);
> > > +
> > > +				for (int i = 0; i < q->width; i++) {
> > > +					lrc = xe_exec_queue_get_lrc(q, i);
> > > +					if (lrc->bo == pbo) {
> > > +						xe_lrc_put(lrc);
> > > +						mutex_lock(&xef-
> > >exec_queue.lock);
> > > +						xe_exec_queue_put(q);
> > > +						mutex_unlock(&xef-
> > >exec_queue.lock);
> > > +						mutex_unlock(&dev-
> > >filelist_mutex);
> > > +						return false;
> > > +					}
> > > +					xe_lrc_put(lrc);
> > > +				}
> > > +				mutex_lock(&xef->exec_queue.lock);
> > > +				xe_exec_queue_put(q);
> > > +				mutex_unlock(&xef->exec_queue.lock);
> > > +			}
> > > +		}
> > > +		mutex_unlock(&dev->filelist_mutex);
> > > +		return true;
> > > +	}
> > > +	return false;
> > > +}
> > > +
> > > +static void xe_ttm_vram_purge_page(struct xe_device *xe, struct xe_bo
> > > +*pbo) {
> > > +	struct ttm_placement place = {};
> > > +	struct ttm_operation_ctx ctx = {
> > > +		.interruptible = false,
> > > +		.gfp_retry_mayfail = false,
> > > +	};
> > > +	bool locked;
> > > +	int ret = 0;
> > > +
> > > +	/*  Ban VM if BO is PPGTT */
> > > +	if (pbo->ttm.type == ttm_bo_type_kernel &&
> > > +	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
> > > +	    pbo->flags & XE_BO_FLAG_PAGETABLE) {
> > 
> > I think XE_BO_FLAG_PAGETABLE and XE_BO_FLAG_FORCE_USER_VRAM are
> > sufficient here.
> > 
> > Also, if XE_BO_FLAG_PAGETABLE is set but XE_BO_FLAG_FORCE_USER_VRAM
> > is clear, that means this is a kernel VM and we probably have to wedge the
> > device, right?
> 
> I am looking at all other review comments , meanwhile as quick response, @Aravind Iddamsetty was suggesting to do SBR reset for critical BOs in place of wedge.
> 

That very well could be correct — it would have to be some kind of
global event if a critical BO encounters an error and the driver needs
to recover. Part of recover process has to be replace the critical BO
with a new BO too, right?

A few more comments below.

> Tejas
> > 
> > > +		down_write(&pbo->vm->lock);
> > > +		xe_vm_kill(pbo->vm, true);
> > > +		up_write(&pbo->vm->lock);
> > > +	}
> > > +
> > > +	/*  Ban exec queue if BO is lrc */
> > > +	if (pbo->ttm.type == ttm_bo_type_kernel &&
> > > +	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
> > > +	    (pbo->flags & (XE_BO_FLAG_GGTT |
> > XE_BO_FLAG_GGTT_INVALIDATE)) &&
> > > +	    !(pbo->flags & XE_BO_FLAG_PAGETABLE)) {
> > 
> > 
> > This is a huge if statement just to determine whether this is an LRC. At a
> > minimum, we’d need to normalize this, and it looks very fragile—if we change
> > flags elsewhere in the driver, this if statement could easily break.
> > 
> > Also, I can’t say I’m a fan of searching just to kill an individual queue.
> > 
> > It’s a bit unfortunate that LRCs are created without a VM (I forget the exact
> > reasoning, but I seem to recall it was related to multi-q?)
> > 
> > I think what we really want to do is:
> > 
> > - If we find a PT or LRC BO, kill the VM.

If the PT encounters an error, we likely also want to immediately
invalidate the VM’s page table structure to avoid a device page-table
walk reading a corrupted PT. xe_vm_close() does this in the code below:

1828                 if (bound) {
1829                         for_each_tile(tile, xe, id)
1830                                 if (vm->pt_root[id])
1831                                         xe_pt_clear(xe, vm->pt_root[id]);
1832
1833                         for_each_gt(gt, xe, id)
1834                                 xe_tlb_inval_vm(&gt->tlb_inval, vm);
1835                 }

Maybe this could be extracted into a helper and called here, likely
after xe_vm_kill(). There is also a weird corner case where the PT that
encounters the error is vm->pt_root[id], which is particularly bad,
because in that case we can’t call xe_pt_clear(). That operation
involves a CPU write, and if I remember correctly, things go really
badly then.

> > - Update ‘kill VM’ to kill all exec queues. I honestly forget why we
> >   only kill preempt/rebind queues—it’s likely some nonsensical reasoning
> >   that we never cleaned up. We already have xe_vm_add_exec_queue(), which
> >   is short-circuited on xe->info.has_ctx_tlb_inval, but we can just
> >   remove that.
> > - Normalize this with an LRC BO flag and store the user_vm in the BO for
> >   LRCs.
> > - Critical kernel BOs normalized with BO flag -> wedge the device
> > 
> > The difference between killing a queue and killing a VM doesn’t really matter
> > from a user-space point of view, since typically a single-queue hang leads to
> > the entire process crashing or restarting—at least for Mesa 3D. We should
> > confirm with compute whether this is also what we’re targeting for CRI, but I
> > suspect the answer is the same. Even if it isn’t, I’m not convinced per-queue
> > killing is worthwhile. And if we decide it is, the filelist / exec_queue.xa search is
> > pretty much a non-starter for me—for example, we’d need to make this much
> > simpler and avoid taking a bunch of locks here, which looks pretty scary.
> > 
> > > +		struct drm_device *dev = &xe->drm;
> > > +		struct xe_exec_queue *q;
> > > +		struct drm_file *file;
> > > +		struct xe_lrc *lrc;
> > > +		unsigned long idx;
> > > +
> > > +		/* TODO : Need to extend to multitile in future if needed */
> > > +		mutex_lock(&dev->filelist_mutex);
> > > +		list_for_each_entry(file, &dev->filelist, lhead) {
> > > +			struct xe_file *xef = file->driver_priv;
> > > +
> > > +			mutex_lock(&xef->exec_queue.lock);
> > > +			xa_for_each(&xef->exec_queue.xa, idx, q) {
> > > +				xe_exec_queue_get(q);
> > > +				mutex_unlock(&xef->exec_queue.lock);
> > > +
> > > +				for (int i = 0; i < q->width; i++) {
> > > +					lrc = xe_exec_queue_get_lrc(q, i);
> > > +					if (lrc->bo == pbo) {
> > > +						xe_lrc_put(lrc);
> > > +						xe_exec_queue_kill(q);
> > > +					} else {
> > > +						xe_lrc_put(lrc);
> > > +					}
> > > +				}
> > > +
> > > +				mutex_lock(&xef->exec_queue.lock);
> > > +				xe_exec_queue_put(q);
> > > +				mutex_unlock(&xef->exec_queue.lock);
> > > +			}
> > > +		}
> > > +		mutex_unlock(&dev->filelist_mutex);
> > > +	}
> > > +
> > > +	spin_lock(&pbo->ttm.bdev->lru_lock);
> > > +	locked = dma_resv_trylock(pbo->ttm.base.resv);
> > > +	spin_unlock(&pbo->ttm.bdev->lru_lock);
> > > +	WARN_ON(!locked);
> > 
> > Is there any reason why we can’t just take a sleeping dma_resv_lock here (e.g.
> > xe_bo_lock)? Also, I think the trick with the LRU lock only works once the BO’s
> > dma_resv has been individualized (kref == 0), which is clearly not the case
> > here.
> > 
> > > +	ret = ttm_bo_validate(&pbo->ttm, &place, &ctx);

I thought I had typed this out, but with the purging-BO series now
merged, we have a helper for purging: xe_ttm_bo_purge(). I don’t think
it works exactly for this case, but with small updates, I believe it
could be made to work. It also does things like remove the page tables
for the BO, which I think is desired.

Matt

> > > +	drm_WARN_ON(&xe->drm, ret);
> > > +	xe_bo_put(pbo);
> > > +	if (locked)
> > > +		dma_resv_unlock(pbo->ttm.base.resv);
> > > +}
> > > +
> > > +static int xe_ttm_vram_reserve_page_at_addr(struct xe_device *xe,
> > unsigned long addr,
> > > +					    struct xe_ttm_vram_mgr
> > *vram_mgr, struct gpu_buddy *mm) {
> > > +	struct xe_ttm_vram_offline_resource *nentry;
> > > +	struct ttm_buffer_object *tbo = NULL;
> > > +	struct gpu_buddy_block *block;
> > > +	struct gpu_buddy_block *b, *m;
> > > +	enum reserve_status {
> > > +		pending = 0,
> > > +		fail
> > > +	};
> > > +	u64 size = SZ_4K;
> > > +	int ret = 0;
> > > +
> > > +	mutex_lock(&vram_mgr->lock);
> > 
> > You’re going to have to fix the locking here. For example, the lock is released
> > inside nested if statements below, which makes this function very difficult to
> > follow. Personally, I can’t really focus on anything else until this is cleaned up.
> > I’m not saying we don’t already have bad locking patterns in Xe—I’m sure we
> > do—but let’s avoid introducing new code with those patterns.
> > 
> > For example, it should look more like this:
> > 
> > mutex_lock(&vram_mgr->lock);
> > /* Do the minimal work that requires the lock */ mutex_unlock(&vram_mgr-
> > >lock);
> > 
> > /* Do other work where &vram_mgr->lock needs to be dropped */
> > 
> > mutex_lock(&vram_mgr->lock);
> > /* Do more work that requires the lock */ mutex_unlock(&vram_mgr->lock);
> > 
> > Also strongly prefer guards or scoped_guards too.
> > 
> > > +	block = gpu_buddy_addr_to_block(mm, addr);
> > > +	if (PTR_ERR(block) == -ENXIO) {
> > > +		mutex_unlock(&vram_mgr->lock);
> > > +		return -ENXIO;
> > > +	}
> > > +
> > > +	nentry = kzalloc_obj(*nentry);
> > > +	if (!nentry)
> > > +		return -ENOMEM;
> > > +	INIT_LIST_HEAD(&nentry->blocks);
> > > +	nentry->status = pending;
> > > +
> > > +	if (block) {
> > > +		struct xe_ttm_vram_offline_resource *pos, *n;
> > > +		struct xe_bo *pbo;
> > > +
> > > +		WARN_ON(!block->private);
> > > +		tbo = block->private;
> > > +		pbo = ttm_to_xe_bo(tbo);
> > > +
> > > +		xe_bo_get(pbo);
> > 
> > This probably needs a kref get if it’s non‑zero. If this is a zombie BO, it should
> > already be getting destroyed. Also, we’re going to need to look into gutting the
> > TTM pipeline as well, where TTM resources are transferred to different BOs—
> > but there’s enough to clean up here first before we get to that.
> > 
> > I'm going to stop here as there is quite a bit to cleanup / simplify before I can
> > dig in more.
> > 
> > Matt
> > 
> > > +		/* Critical kernel BO? */
> > > +		if (pbo->ttm.type == ttm_bo_type_kernel &&
> > > +		    (!(pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM) ||
> > > +		     is_ttm_vram_migrate_lrc(xe, pbo))) {
> > > +			mutex_unlock(&vram_mgr->lock);
> > > +			kfree(nentry);
> > > +			xe_ttm_vram_free_bad_pages(&xe->drm, vram_mgr);
> > > +			xe_bo_put(pbo);
> > > +			drm_err(&xe->drm,
> > > +				"%s: corrupt addr: 0x%lx in critical kernel bo,
> > request reset\n",
> > > +				__func__, addr);
> > > +			/* Hint System controller driver for reset with -EIO  */
> > > +			return -EIO;
> > > +		}
> > > +		nentry->id = ++vram_mgr->n_queued_pages;
> > > +		list_add(&nentry->queued_link, &vram_mgr-
> > >queued_pages);
> > > +		mutex_unlock(&vram_mgr->lock);
> > > +
> > > +		/* Purge BO containing address */
> > > +		 xe_ttm_vram_purge_page(xe, pbo);
> > > +
> > > +		/* Reserve page at address addr*/
> > > +		mutex_lock(&vram_mgr->lock);
> > > +		ret = gpu_buddy_alloc_blocks(mm, addr, addr + size,
> > > +					     size, size, &nentry->blocks,
> > > +
> > GPU_BUDDY_RANGE_ALLOCATION);
> > > +
> > > +		if (ret) {
> > > +			drm_warn(&xe->drm, "Could not reserve page at
> > addr:0x%lx, ret:%d\n",
> > > +				 addr, ret);
> > > +			nentry->status = fail;
> > > +			mutex_unlock(&vram_mgr->lock);
> > > +			return ret;
> > > +		}
> > > +
> > > +		list_for_each_entry_safe(b, m, &nentry->blocks, link)
> > > +			b->private = NULL;
> > > +
> > > +		if ((addr + size) <= vram_mgr->visible_size) {
> > > +			nentry->used_visible_size = size;
> > > +		} else {
> > > +			list_for_each_entry(b, &nentry->blocks, link) {
> > > +				u64 start = gpu_buddy_block_offset(b);
> > > +
> > > +				if (start < vram_mgr->visible_size) {
> > > +					u64 end = start +
> > gpu_buddy_block_size(mm, b);
> > > +
> > > +					nentry->used_visible_size +=
> > > +						min(end, vram_mgr-
> > >visible_size) - start;
> > > +				}
> > > +			}
> > > +		}
> > > +		vram_mgr->visible_avail -= nentry->used_visible_size;
> > > +		list_for_each_entry_safe(pos, n, &vram_mgr->queued_pages,
> > queued_link) {
> > > +			if (pos->id == nentry->id) {
> > > +				--vram_mgr->n_queued_pages;
> > > +				list_del(&pos->queued_link);
> > > +				break;
> > > +			}
> > > +		}
> > > +		list_add(&nentry->offlined_link, &vram_mgr-
> > >offlined_pages);
> > > +		/* TODO: FW Integration: Send command to FW for offlining
> > page */
> > > +		++vram_mgr->n_offlined_pages;
> > > +		mutex_unlock(&vram_mgr->lock);
> > > +		return ret;
> > > +
> > > +	} else {
> > > +		ret = gpu_buddy_alloc_blocks(mm, addr, addr + size,
> > > +					     size, size, &nentry->blocks,
> > > +
> > GPU_BUDDY_RANGE_ALLOCATION);
> > > +		if (ret) {
> > > +			drm_warn(&xe->drm, "Could not reserve page at
> > addr:0x%lx, ret:%d\n",
> > > +				 addr, ret);
> > > +			nentry->status = fail;
> > > +			mutex_unlock(&vram_mgr->lock);
> > > +			return ret;
> > > +		}
> > > +
> > > +		list_for_each_entry_safe(b, m, &nentry->blocks, link)
> > > +			b->private = NULL;
> > > +
> > > +		if ((addr + size) <= vram_mgr->visible_size) {
> > > +			nentry->used_visible_size = size;
> > > +		} else {
> > > +			struct gpu_buddy_block *block;
> > > +
> > > +			list_for_each_entry(block, &nentry->blocks, link) {
> > > +				u64 start = gpu_buddy_block_offset(block);
> > > +
> > > +				if (start < vram_mgr->visible_size) {
> > > +					u64 end = start +
> > gpu_buddy_block_size(mm, block);
> > > +
> > > +					nentry->used_visible_size +=
> > > +						min(end, vram_mgr-
> > >visible_size) - start;
> > > +				}
> > > +			}
> > > +		}
> > > +		vram_mgr->visible_avail -= nentry->used_visible_size;
> > > +		nentry->id = ++vram_mgr->n_offlined_pages;
> > > +		list_add(&nentry->offlined_link, &vram_mgr-
> > >offlined_pages);
> > > +		/* TODO: FW Integration: Send command to FW for offlining
> > page */
> > > +		mutex_unlock(&vram_mgr->lock);
> > > +	}
> > > +	/* Success */
> > > +	return ret;
> > > +}
> > > +
> > > +static struct xe_vram_region *xe_ttm_vram_addr_to_region(struct
> > xe_device *xe,
> > > +							 resource_size_t addr)
> > > +{
> > > +	unsigned long stolen_base = xe_ttm_stolen_gpu_offset(xe);
> > > +	struct xe_vram_region *vr;
> > > +	struct xe_tile *tile;
> > > +	int id;
> > > +
> > > +	/* Addr from stolen memory? */
> > > +	if (addr + SZ_4K >= stolen_base)
> > > +		return NULL;
> > > +
> > > +	for_each_tile(tile, xe, id) {
> > > +		vr = tile->mem.vram;
> > > +		if ((addr <= vr->dpa_base + vr->actual_physical_size) &&
> > > +		    (addr + SZ_4K >= vr->dpa_base))
> > > +			return vr;
> > > +	}
> > > +	return NULL;
> > > +}
> > > +
> > > +/**
> > > + * xe_ttm_vram_handle_addr_fault - Handle vram physical address error
> > > +flaged
> > > + * @xe: pointer to parent device
> > > + * @addr: physical faulty address
> > > + *
> > > + * Handle the physcial faulty address error on specific tile.
> > > + *
> > > + * Returns 0 for success, negative error code otherwise.
> > > + */
> > > +int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long
> > > +addr) {
> > > +	struct xe_ttm_vram_mgr *vram_mgr;
> > > +	struct xe_vram_region *vr;
> > > +	struct gpu_buddy *mm;
> > > +	int ret;
> > > +
> > > +	vr = xe_ttm_vram_addr_to_region(xe, addr);
> > > +	if (!vr) {
> > > +		drm_err(&xe->drm, "%s:%d addr:%lx error requesting SBR\n",
> > > +			__func__, __LINE__, addr);
> > > +		/* Hint System controller driver for reset with -EIO  */
> > > +		return -EIO;
> > > +	}
> > > +	vram_mgr = &vr->ttm;
> > > +	mm = &vram_mgr->mm;
> > > +	/* Reserve page at address */
> > > +	ret = xe_ttm_vram_reserve_page_at_addr(xe, addr, vram_mgr, mm);
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL(xe_ttm_vram_handle_addr_fault);
> > > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > index 87b7fae5edba..8ef06d9d44f7 100644
> > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > @@ -31,6 +31,7 @@ u64 xe_ttm_vram_get_cpu_visible_size(struct
> > > ttm_resource_manager *man);  void xe_ttm_vram_get_used(struct
> > ttm_resource_manager *man,
> > >  			  u64 *used, u64 *used_visible);
> > >
> > > +int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned long
> > > +addr);
> > >  static inline struct xe_ttm_vram_mgr_resource *
> > > to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)  { diff --git
> > > a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> > > index 9106da056b49..94eaf9d875f1 100644
> > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> > > @@ -19,6 +19,14 @@ struct xe_ttm_vram_mgr {
> > >  	struct ttm_resource_manager manager;
> > >  	/** @mm: DRM buddy allocator which manages the VRAM */
> > >  	struct gpu_buddy mm;
> > > +	/** @offlined_pages: List of offlined pages */
> > > +	struct list_head offlined_pages;
> > > +	/** @n_offlined_pages: Number of offlined pages */
> > > +	u16 n_offlined_pages;
> > > +	/** @queued_pages: List of queued pages */
> > > +	struct list_head queued_pages;
> > > +	/** @n_queued_pages: Number of queued pages */
> > > +	u16 n_queued_pages;
> > >  	/** @visible_size: Proped size of the CPU visible portion */
> > >  	u64 visible_size;
> > >  	/** @visible_avail: CPU visible portion still unallocated */ @@
> > > -45,4 +53,22 @@ struct xe_ttm_vram_mgr_resource {
> > >  	unsigned long flags;
> > >  };
> > >
> > > +/**
> > > + * struct xe_ttm_vram_offline_resource - Xe TTM VRAM offline
> > > +resource  */ struct xe_ttm_vram_offline_resource {
> > > +	/** @offlined_link: Link to offlined pages */
> > > +	struct list_head offlined_link;
> > > +	/** @queued_link: Link to queued pages */
> > > +	struct list_head queued_link;
> > > +	/** @blocks: list of DRM buddy blocks */
> > > +	struct list_head blocks;
> > > +	/** @used_visible_size: How many CPU visible bytes this resource is
> > using */
> > > +	u64 used_visible_size;
> > > +	/** @id: The id of an offline resource */
> > > +	u16 id;
> > > +	/** @status: reservation status of resource */
> > > +	bool status;
> > > +};
> > > +
> > >  #endif
> > > --
> > > 2.52.0
> > >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH V6 1/7] drm/xe: Link VRAM object with gpu buddy
  2026-04-02  9:10     ` Upadhyay, Tejas
@ 2026-04-02 20:50       ` Matthew Brost
  2026-04-06 11:04         ` Upadhyay, Tejas
  0 siblings, 1 reply; 24+ messages in thread
From: Matthew Brost @ 2026-04-02 20:50 UTC (permalink / raw)
  To: Upadhyay, Tejas
  Cc: intel-xe@lists.freedesktop.org, Auld,  Matthew,
	thomas.hellstrom@linux.intel.com, Ghimiray, Himal Prasad

On Thu, Apr 02, 2026 at 03:10:02AM -0600, Upadhyay, Tejas wrote:
> 
> 
> > -----Original Message-----
> > From: Brost, Matthew <matthew.brost@intel.com>
> > Sent: 02 April 2026 05:27
> > To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Auld, Matthew
> > <matthew.auld@intel.com>; thomas.hellstrom@linux.intel.com; Ghimiray,
> > Himal Prasad <himal.prasad.ghimiray@intel.com>
> > Subject: Re: [RFC PATCH V6 1/7] drm/xe: Link VRAM object with gpu buddy
> > 
> > On Fri, Mar 27, 2026 at 05:18:14PM +0530, Tejas Upadhyay wrote:
> > > Setup to link TTM buffer object inside gpu buddy. This functionality
> > > is critical for supporting the memory page offline feature on CRI,
> > > where identified faulty pages must be traced back to their originating
> > > buffer for safe removal.
> > >
> > 
> > I just checked the SVM code is still setting 'block->private'.
> > 
> > I thought we had patch for that but I don't see it in this series.
> 
> Yeah, you are right, it was there when I sent earlier revision, I somehow missed adding that patch in this series. I will add it back in next revision. Sorry for confusion here.
> 

Also free to send that as indepent patch as we should be able to merge
that one whenever.

Another comment.

> Tejas
> > 
> > Matt
> > 
> > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > index 5fd0d5506a7e..c627dbf94552 100644
> > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > @@ -54,6 +54,7 @@ static int xe_ttm_vram_mgr_new(struct
> > ttm_resource_manager *man,
> > >  	struct xe_ttm_vram_mgr *mgr = to_xe_ttm_vram_mgr(man);
> > >  	struct xe_ttm_vram_mgr_resource *vres;
> > >  	struct gpu_buddy *mm = &mgr->mm;
> > > +	struct gpu_buddy_block *block;
> > >  	u64 size, min_page_size;
> > >  	unsigned long lpfn;
> > >  	int err;
> > > @@ -138,6 +139,8 @@ static int xe_ttm_vram_mgr_new(struct
> > ttm_resource_manager *man,
> > >  	}
> > >
> > >  	mgr->visible_avail -= vres->used_visible_size;
> > > +	list_for_each_entry(block, &vres->blocks, link)
> > > +		block->private = tbo;

Should we clear block->private in xe_ttm_vram_mgr_del() to avoid blocks
having stale references to BOs? Or are there guarantees elsewhere in the
code that a free block cannot reference block->private as a BO if it
encounters an error while in a free state? For safety and clarity, even
if such guarantees exist, it may still make sense to clear
block->private in xe_ttm_vram_mgr_del().

Matt

> > >  	mutex_unlock(&mgr->lock);
> > >
> > >  	if (!(vres->base.placement & TTM_PL_FLAG_CONTIGUOUS) &&
> > > --
> > > 2.52.0
> > >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [RFC PATCH V6 1/7] drm/xe: Link VRAM object with gpu buddy
  2026-04-02 20:50       ` Matthew Brost
@ 2026-04-06 11:04         ` Upadhyay, Tejas
  0 siblings, 0 replies; 24+ messages in thread
From: Upadhyay, Tejas @ 2026-04-06 11:04 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Auld,  Matthew,
	thomas.hellstrom@linux.intel.com, Ghimiray, Himal Prasad



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: 03 April 2026 02:20
> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> Cc: intel-xe@lists.freedesktop.org; Auld, Matthew
> <matthew.auld@intel.com>; thomas.hellstrom@linux.intel.com; Ghimiray,
> Himal Prasad <himal.prasad.ghimiray@intel.com>
> Subject: Re: [RFC PATCH V6 1/7] drm/xe: Link VRAM object with gpu buddy
> 
> On Thu, Apr 02, 2026 at 03:10:02AM -0600, Upadhyay, Tejas wrote:
> >
> >
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: 02 April 2026 05:27
> > > To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> > > Cc: intel-xe@lists.freedesktop.org; Auld, Matthew
> > > <matthew.auld@intel.com>; thomas.hellstrom@linux.intel.com;
> > > Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>
> > > Subject: Re: [RFC PATCH V6 1/7] drm/xe: Link VRAM object with gpu
> > > buddy
> > >
> > > On Fri, Mar 27, 2026 at 05:18:14PM +0530, Tejas Upadhyay wrote:
> > > > Setup to link TTM buffer object inside gpu buddy. This
> > > > functionality is critical for supporting the memory page offline
> > > > feature on CRI, where identified faulty pages must be traced back
> > > > to their originating buffer for safe removal.
> > > >
> > >
> > > I just checked the SVM code is still setting 'block->private'.
> > >
> > > I thought we had patch for that but I don't see it in this series.
> >
> > Yeah, you are right, it was there when I sent earlier revision, I somehow
> missed adding that patch in this series. I will add it back in next revision. Sorry
> for confusion here.
> >
> 
> Also free to send that as indepent patch as we should be able to merge that
> one whenever.
> 
> Another comment.
> 
> > Tejas
> > >
> > > Matt
> > >
> > > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > index 5fd0d5506a7e..c627dbf94552 100644
> > > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > @@ -54,6 +54,7 @@ static int xe_ttm_vram_mgr_new(struct
> > > ttm_resource_manager *man,
> > > >  	struct xe_ttm_vram_mgr *mgr = to_xe_ttm_vram_mgr(man);
> > > >  	struct xe_ttm_vram_mgr_resource *vres;
> > > >  	struct gpu_buddy *mm = &mgr->mm;
> > > > +	struct gpu_buddy_block *block;
> > > >  	u64 size, min_page_size;
> > > >  	unsigned long lpfn;
> > > >  	int err;
> > > > @@ -138,6 +139,8 @@ static int xe_ttm_vram_mgr_new(struct
> > > ttm_resource_manager *man,
> > > >  	}
> > > >
> > > >  	mgr->visible_avail -= vres->used_visible_size;
> > > > +	list_for_each_entry(block, &vres->blocks, link)
> > > > +		block->private = tbo;
> 
> Should we clear block->private in xe_ttm_vram_mgr_del() to avoid blocks
> having stale references to BOs? Or are there guarantees elsewhere in the code
> that a free block cannot reference block->private as a BO if it encounters an
> error while in a free state? For safety and clarity, even if such guarantees exist,
> it may still make sense to clear
> block->private in xe_ttm_vram_mgr_del().

Yes, there is handling for clearing block->private right after reserving via gpu_buddy_alloc_blocks() in xe_ttm_vram_reserve_page_at_addr().  But I think no harm to clear in xe_ttm_vram_mgr_del()  as well it seems.

Tejas

> 
> Matt
> 
> > > >  	mutex_unlock(&mgr->lock);
> > > >
> > > >  	if (!(vres->base.placement & TTM_PL_FLAG_CONTIGUOUS) &&
> > > > --
> > > > 2.52.0
> > > >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [RFC PATCH V6 3/7] drm/xe: Handle physical memory address error
  2026-04-02 20:20       ` Matthew Brost
@ 2026-04-07 12:03         ` Upadhyay, Tejas
  0 siblings, 0 replies; 24+ messages in thread
From: Upadhyay, Tejas @ 2026-04-07 12:03 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: Aravind Iddamsetty, intel-xe@lists.freedesktop.org, Auld, Matthew,
	thomas.hellstrom@linux.intel.com, Ghimiray, Himal Prasad



> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: 03 April 2026 01:50
> To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> Cc: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>; intel-
> xe@lists.freedesktop.org; Auld, Matthew <matthew.auld@intel.com>;
> thomas.hellstrom@linux.intel.com; Ghimiray, Himal Prasad
> <himal.prasad.ghimiray@intel.com>
> Subject: Re: [RFC PATCH V6 3/7] drm/xe: Handle physical memory address
> error
> 
> On Thu, Apr 02, 2026 at 04:30:47AM -0600, Upadhyay, Tejas wrote:
> >
> >
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: 02 April 2026 06:34
> > > To: Upadhyay, Tejas <tejas.upadhyay@intel.com>
> > > Cc: intel-xe@lists.freedesktop.org; Auld, Matthew
> > > <matthew.auld@intel.com>; thomas.hellstrom@linux.intel.com;
> > > Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>
> > > Subject: Re: [RFC PATCH V6 3/7] drm/xe: Handle physical memory
> > > address error
> > >
> > > On Fri, Mar 27, 2026 at 05:18:16PM +0530, Tejas Upadhyay wrote:
> > > > This functionality represents a significant step in making the xe
> > > > driver gracefully handle hardware memory degradation.
> > > > By integrating with the DRM Buddy allocator, the driver can
> > > > permanently "carve out" faulty memory so it isn't reused by
> > > > subsequent allocations.
> > > >
> > > > Buddy Block Reservation:
> > > > ----------------------
> > > > When a memory address is reported as faulty, the driver instructs
> > > > the DRM Buddy allocator to reserve a block of the specific page
> > > > size (typically 4KB). This marks the memory as "dirty/used"
> > > > indefinitely.
> > > >
> > > > Two-Stage Tracking:
> > > > -----------------
> > > > Offlined Pages:
> > > > Pages that have been successfully isolated and removed from the
> > > > available memory pool.
> > > >
> > > > Queued Pages:
> > > > Addresses that have been flagged as faulty but are currently in
> > > > use by a process. These are tracked until the associated buffer
> > > > object (BO) is released or migrated, at which point they move to the
> "offlined"
> > > > state.
> > > >
> > > > Sysfs Reporting:
> > > > --------------
> > > > The patch exposes these metrics through a standard interface,
> > > > allowing administrators to monitor VRAM health:
> > > > /sys/bus/pci/devices/<device_id>/vram_bad_bad_pages
> > > >
> > > > V5:
> > > > - Categorise and handle BOs accordingly
> > > > - Fix crash found with new debugfs tests
> > > > V4:
> > > > - Set block->private NULL post bo purge
> > > > - Filter out gsm address early on
> > > > - Rebase
> > > > V3:
> > > > -rename api, remove tile dependency and add status of reservation
> > > > V2:
> > > > - Fix mm->avail counter issue
> > > > - Remove unused code and handle clean up in case of error
> > > >
> > > > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.c       | 336
> > > +++++++++++++++++++++
> > > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr.h       |   1 +
> > > >  drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h |  26 ++
> > > >  3 files changed, 363 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > index c627dbf94552..0fec7b332501 100644
> > > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > > > @@ -13,7 +13,10 @@
> > > >
> > > >  #include "xe_bo.h"
> > > >  #include "xe_device.h"
> > > > +#include "xe_exec_queue.h"
> > > > +#include "xe_lrc.h"
> > > >  #include "xe_res_cursor.h"
> > > > +#include "xe_ttm_stolen_mgr.h"
> > > >  #include "xe_ttm_vram_mgr.h"
> > > >  #include "xe_vram_types.h"
> > > >
> > > > @@ -277,6 +280,26 @@ static const struct ttm_resource_manager_func
> > > xe_ttm_vram_mgr_func = {
> > > >  	.debug	= xe_ttm_vram_mgr_debug
> > > >  };
> > > >
> > > > +static void xe_ttm_vram_free_bad_pages(struct drm_device *dev,
> > > > +struct xe_ttm_vram_mgr *mgr) {
> > > > +	struct xe_ttm_vram_offline_resource *pos, *n;
> > > > +
> > > > +	mutex_lock(&mgr->lock);
> > > > +	list_for_each_entry_safe(pos, n, &mgr->offlined_pages,
> > > > +offlined_link)
> > > {
> > > > +		--mgr->n_offlined_pages;
> > > > +		gpu_buddy_free_list(&mgr->mm, &pos->blocks, 0);
> > > > +		mgr->visible_avail += pos->used_visible_size;
> > > > +		list_del(&pos->offlined_link);
> > > > +		kfree(pos);
> > > > +	}
> > > > +	list_for_each_entry_safe(pos, n, &mgr->queued_pages,
> > > > +queued_link)
> > > {
> > > > +		list_del(&pos->queued_link);
> > > > +		mgr->n_queued_pages--;
> > > > +		kfree(pos);
> > > > +	}
> > > > +	mutex_unlock(&mgr->lock);
> > > > +}
> > > > +
> > > >  static void xe_ttm_vram_mgr_fini(struct drm_device *dev, void
> > > > *arg) {
> > > >  	struct xe_device *xe = to_xe_device(dev); @@ -288,6 +311,8 @@
> > > static
> > > > void xe_ttm_vram_mgr_fini(struct drm_device *dev, void *arg)
> > > >  	if (ttm_resource_manager_evict_all(&xe->ttm, man))
> > > >  		return;
> > > >
> > > > +	xe_ttm_vram_free_bad_pages(dev, mgr);
> > > > +
> > > >  	WARN_ON_ONCE(mgr->visible_avail != mgr->visible_size);
> > > >
> > > >  	gpu_buddy_fini(&mgr->mm);
> > > > @@ -316,6 +341,8 @@ int __xe_ttm_vram_mgr_init(struct xe_device
> > > > *xe,
> > > struct xe_ttm_vram_mgr *mgr,
> > > >  	man->func = &xe_ttm_vram_mgr_func;
> > > >  	mgr->mem_type = mem_type;
> > > >  	mutex_init(&mgr->lock);
> > > > +	INIT_LIST_HEAD(&mgr->offlined_pages);
> > > > +	INIT_LIST_HEAD(&mgr->queued_pages);
> > > >  	mgr->default_page_size = default_page_size;
> > > >  	mgr->visible_size = io_size;
> > > >  	mgr->visible_avail = io_size;
> > > > @@ -471,3 +498,312 @@ u64 xe_ttm_vram_get_avail(struct
> > > > ttm_resource_manager *man)
> > > >
> > > >  	return avail;
> > > >  }
> > > > +
> > > > +static bool is_ttm_vram_migrate_lrc(struct xe_device *xe, struct
> > > > +xe_bo *pbo)
> > >
> > > As discussed in prior reply [1] - I think this can be dropped.
> > >
> > > [1]
> > >
> https://patchwork.freedesktop.org/patch/714756/?series=161473&rev=6#
> > > c
> > > omment_1318048
> > >
> > > > +{
> > > > +	if (pbo->ttm.type == ttm_bo_type_kernel &&
> > > > +	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
> > > > +	    (pbo->flags & (XE_BO_FLAG_GGTT |
> > > XE_BO_FLAG_GGTT_INVALIDATE)) &&
> > > > +	    !(pbo->flags & XE_BO_FLAG_PAGETABLE)) {
> > > > +		unsigned long idx;
> > > > +		struct xe_exec_queue *q;
> > > > +		struct drm_device *dev = &xe->drm;
> > > > +		struct drm_file *file;
> > > > +		struct xe_lrc *lrc;
> > > > +
> > > > +		/* TODO : Need to extend to multitile in future if needed */
> > > > +		mutex_lock(&dev->filelist_mutex);
> > > > +		list_for_each_entry(file, &dev->filelist, lhead) {
> > > > +			struct xe_file *xef = file->driver_priv;
> > > > +
> > > > +			mutex_lock(&xef->exec_queue.lock);
> > > > +			xa_for_each(&xef->exec_queue.xa, idx, q) {
> > > > +				xe_exec_queue_get(q);
> > > > +				mutex_unlock(&xef->exec_queue.lock);
> > > > +
> > > > +				for (int i = 0; i < q->width; i++) {
> > > > +					lrc = xe_exec_queue_get_lrc(q, i);
> > > > +					if (lrc->bo == pbo) {
> > > > +						xe_lrc_put(lrc);
> > > > +						mutex_lock(&xef-
> > > >exec_queue.lock);
> > > > +						xe_exec_queue_put(q);
> > > > +						mutex_unlock(&xef-
> > > >exec_queue.lock);
> > > > +						mutex_unlock(&dev-
> > > >filelist_mutex);
> > > > +						return false;
> > > > +					}
> > > > +					xe_lrc_put(lrc);
> > > > +				}
> > > > +				mutex_lock(&xef->exec_queue.lock);
> > > > +				xe_exec_queue_put(q);
> > > > +				mutex_unlock(&xef->exec_queue.lock);
> > > > +			}
> > > > +		}
> > > > +		mutex_unlock(&dev->filelist_mutex);
> > > > +		return true;
> > > > +	}
> > > > +	return false;
> > > > +}
> > > > +
> > > > +static void xe_ttm_vram_purge_page(struct xe_device *xe, struct
> > > > +xe_bo
> > > > +*pbo) {
> > > > +	struct ttm_placement place = {};
> > > > +	struct ttm_operation_ctx ctx = {
> > > > +		.interruptible = false,
> > > > +		.gfp_retry_mayfail = false,
> > > > +	};
> > > > +	bool locked;
> > > > +	int ret = 0;
> > > > +
> > > > +	/*  Ban VM if BO is PPGTT */
> > > > +	if (pbo->ttm.type == ttm_bo_type_kernel &&
> > > > +	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
> > > > +	    pbo->flags & XE_BO_FLAG_PAGETABLE) {
> > >
> > > I think XE_BO_FLAG_PAGETABLE and XE_BO_FLAG_FORCE_USER_VRAM
> are
> > > sufficient here.
> > >
> > > Also, if XE_BO_FLAG_PAGETABLE is set but
> XE_BO_FLAG_FORCE_USER_VRAM
> > > is clear, that means this is a kernel VM and we probably have to
> > > wedge the device, right?
> >
> > I am looking at all other review comments , meanwhile as quick response,
> @Aravind Iddamsetty was suggesting to do SBR reset for critical BOs in place
> of wedge.
> >
> 
> That very well could be correct — it would have to be some kind of global
> event if a critical BO encounters an error and the driver needs to recover. Part
> of recover process has to be replace the critical BO with a new BO too, right?

Had discussion with Aravind on this, SBR reset will be like fresh memory init next time so it will be fresh start and on fresh memory init based on policy we will reserve that page early on as it will be part of  "to be offline" pages  fetched from FW. So in that case we do not need to replace with new BO.

> 
> A few more comments below.
> 
> > Tejas
> > >
> > > > +		down_write(&pbo->vm->lock);
> > > > +		xe_vm_kill(pbo->vm, true);
> > > > +		up_write(&pbo->vm->lock);
> > > > +	}
> > > > +
> > > > +	/*  Ban exec queue if BO is lrc */
> > > > +	if (pbo->ttm.type == ttm_bo_type_kernel &&
> > > > +	    pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM &&
> > > > +	    (pbo->flags & (XE_BO_FLAG_GGTT |
> > > XE_BO_FLAG_GGTT_INVALIDATE)) &&
> > > > +	    !(pbo->flags & XE_BO_FLAG_PAGETABLE)) {
> > >
> > >
> > > This is a huge if statement just to determine whether this is an
> > > LRC. At a minimum, we’d need to normalize this, and it looks very
> > > fragile—if we change flags elsewhere in the driver, this if statement could
> easily break.
> > >
> > > Also, I can’t say I’m a fan of searching just to kill an individual queue.
> > >
> > > It’s a bit unfortunate that LRCs are created without a VM (I forget
> > > the exact reasoning, but I seem to recall it was related to
> > > multi-q?)
> > >
> > > I think what we really want to do is:
> > >
> > > - If we find a PT or LRC BO, kill the VM.
> 
> If the PT encounters an error, we likely also want to immediately invalidate the
> VM’s page table structure to avoid a device page-table walk reading a
> corrupted PT. xe_vm_close() does this in the code below:
> 
> 1828                 if (bound) {
> 1829                         for_each_tile(tile, xe, id)
> 1830                                 if (vm->pt_root[id])
> 1831                                         xe_pt_clear(xe, vm->pt_root[id]);
> 1832
> 1833                         for_each_gt(gt, xe, id)
> 1834                                 xe_tlb_inval_vm(&gt->tlb_inval, vm);
> 1835                 }
> 
> Maybe this could be extracted into a helper and called here, likely after
> xe_vm_kill(). There is also a weird corner case where the PT that encounters
> the error is vm->pt_root[id], which is particularly bad, because in that case we
> can’t call xe_pt_clear(). That operation involves a CPU write, and if I remember
> correctly, things go really badly then.
> 
> > > - Update ‘kill VM’ to kill all exec queues. I honestly forget why we
> > >   only kill preempt/rebind queues—it’s likely some nonsensical reasoning
> > >   that we never cleaned up. We already have xe_vm_add_exec_queue(),
> which
> > >   is short-circuited on xe->info.has_ctx_tlb_inval, but we can just
> > >   remove that.
> > > - Normalize this with an LRC BO flag and store the user_vm in the BO for
> > >   LRCs.
> > > - Critical kernel BOs normalized with BO flag -> wedge the device
> > >
> > > The difference between killing a queue and killing a VM doesn’t
> > > really matter from a user-space point of view, since typically a
> > > single-queue hang leads to the entire process crashing or
> > > restarting—at least for Mesa 3D. We should confirm with compute
> > > whether this is also what we’re targeting for CRI, but I suspect the
> > > answer is the same. Even if it isn’t, I’m not convinced per-queue
> > > killing is worthwhile. And if we decide it is, the filelist /
> > > exec_queue.xa search is pretty much a non-starter for me—for example,
> we’d need to make this much simpler and avoid taking a bunch of locks here,
> which looks pretty scary.
> > >
> > > > +		struct drm_device *dev = &xe->drm;
> > > > +		struct xe_exec_queue *q;
> > > > +		struct drm_file *file;
> > > > +		struct xe_lrc *lrc;
> > > > +		unsigned long idx;
> > > > +
> > > > +		/* TODO : Need to extend to multitile in future if needed */
> > > > +		mutex_lock(&dev->filelist_mutex);
> > > > +		list_for_each_entry(file, &dev->filelist, lhead) {
> > > > +			struct xe_file *xef = file->driver_priv;
> > > > +
> > > > +			mutex_lock(&xef->exec_queue.lock);
> > > > +			xa_for_each(&xef->exec_queue.xa, idx, q) {
> > > > +				xe_exec_queue_get(q);
> > > > +				mutex_unlock(&xef->exec_queue.lock);
> > > > +
> > > > +				for (int i = 0; i < q->width; i++) {
> > > > +					lrc = xe_exec_queue_get_lrc(q, i);
> > > > +					if (lrc->bo == pbo) {
> > > > +						xe_lrc_put(lrc);
> > > > +						xe_exec_queue_kill(q);
> > > > +					} else {
> > > > +						xe_lrc_put(lrc);
> > > > +					}
> > > > +				}
> > > > +
> > > > +				mutex_lock(&xef->exec_queue.lock);
> > > > +				xe_exec_queue_put(q);
> > > > +				mutex_unlock(&xef->exec_queue.lock);
> > > > +			}
> > > > +		}
> > > > +		mutex_unlock(&dev->filelist_mutex);
> > > > +	}
> > > > +
> > > > +	spin_lock(&pbo->ttm.bdev->lru_lock);
> > > > +	locked = dma_resv_trylock(pbo->ttm.base.resv);
> > > > +	spin_unlock(&pbo->ttm.bdev->lru_lock);
> > > > +	WARN_ON(!locked);
> > >
> > > Is there any reason why we can’t just take a sleeping dma_resv_lock here
> (e.g.
> > > xe_bo_lock)? Also, I think the trick with the LRU lock only works
> > > once the BO’s dma_resv has been individualized (kref == 0), which is
> > > clearly not the case here.
> > >
> > > > +	ret = ttm_bo_validate(&pbo->ttm, &place, &ctx);
> 
> I thought I had typed this out, but with the purging-BO series now merged, we
> have a helper for purging: xe_ttm_bo_purge(). I don’t think it works exactly for
> this case, but with small updates, I believe it could be made to work. It also
> does things like remove the page tables for the BO, which I think is desired.

Yeah, I saw that xe_ttm_bo_purge(), let me see how we can fit this case into it.

Tejas
> 
> Matt
> 
> > > > +	drm_WARN_ON(&xe->drm, ret);
> > > > +	xe_bo_put(pbo);
> > > > +	if (locked)
> > > > +		dma_resv_unlock(pbo->ttm.base.resv);
> > > > +}
> > > > +
> > > > +static int xe_ttm_vram_reserve_page_at_addr(struct xe_device *xe,
> > > unsigned long addr,
> > > > +					    struct xe_ttm_vram_mgr
> > > *vram_mgr, struct gpu_buddy *mm) {
> > > > +	struct xe_ttm_vram_offline_resource *nentry;
> > > > +	struct ttm_buffer_object *tbo = NULL;
> > > > +	struct gpu_buddy_block *block;
> > > > +	struct gpu_buddy_block *b, *m;
> > > > +	enum reserve_status {
> > > > +		pending = 0,
> > > > +		fail
> > > > +	};
> > > > +	u64 size = SZ_4K;
> > > > +	int ret = 0;
> > > > +
> > > > +	mutex_lock(&vram_mgr->lock);
> > >
> > > You’re going to have to fix the locking here. For example, the lock
> > > is released inside nested if statements below, which makes this
> > > function very difficult to follow. Personally, I can’t really focus on anything
> else until this is cleaned up.
> > > I’m not saying we don’t already have bad locking patterns in Xe—I’m
> > > sure we do—but let’s avoid introducing new code with those patterns.
> > >
> > > For example, it should look more like this:
> > >
> > > mutex_lock(&vram_mgr->lock);
> > > /* Do the minimal work that requires the lock */
> > > mutex_unlock(&vram_mgr-
> > > >lock);
> > >
> > > /* Do other work where &vram_mgr->lock needs to be dropped */
> > >
> > > mutex_lock(&vram_mgr->lock);
> > > /* Do more work that requires the lock */
> > > mutex_unlock(&vram_mgr->lock);
> > >
> > > Also strongly prefer guards or scoped_guards too.
> > >
> > > > +	block = gpu_buddy_addr_to_block(mm, addr);
> > > > +	if (PTR_ERR(block) == -ENXIO) {
> > > > +		mutex_unlock(&vram_mgr->lock);
> > > > +		return -ENXIO;
> > > > +	}
> > > > +
> > > > +	nentry = kzalloc_obj(*nentry);
> > > > +	if (!nentry)
> > > > +		return -ENOMEM;
> > > > +	INIT_LIST_HEAD(&nentry->blocks);
> > > > +	nentry->status = pending;
> > > > +
> > > > +	if (block) {
> > > > +		struct xe_ttm_vram_offline_resource *pos, *n;
> > > > +		struct xe_bo *pbo;
> > > > +
> > > > +		WARN_ON(!block->private);
> > > > +		tbo = block->private;
> > > > +		pbo = ttm_to_xe_bo(tbo);
> > > > +
> > > > +		xe_bo_get(pbo);
> > >
> > > This probably needs a kref get if it’s non‑zero. If this is a zombie
> > > BO, it should already be getting destroyed. Also, we’re going to
> > > need to look into gutting the TTM pipeline as well, where TTM
> > > resources are transferred to different BOs— but there’s enough to clean up
> here first before we get to that.
> > >
> > > I'm going to stop here as there is quite a bit to cleanup / simplify
> > > before I can dig in more.
> > >
> > > Matt
> > >
> > > > +		/* Critical kernel BO? */
> > > > +		if (pbo->ttm.type == ttm_bo_type_kernel &&
> > > > +		    (!(pbo->flags & XE_BO_FLAG_FORCE_USER_VRAM) ||
> > > > +		     is_ttm_vram_migrate_lrc(xe, pbo))) {
> > > > +			mutex_unlock(&vram_mgr->lock);
> > > > +			kfree(nentry);
> > > > +			xe_ttm_vram_free_bad_pages(&xe->drm, vram_mgr);
> > > > +			xe_bo_put(pbo);
> > > > +			drm_err(&xe->drm,
> > > > +				"%s: corrupt addr: 0x%lx in critical kernel bo,
> > > request reset\n",
> > > > +				__func__, addr);
> > > > +			/* Hint System controller driver for reset with -EIO  */
> > > > +			return -EIO;
> > > > +		}
> > > > +		nentry->id = ++vram_mgr->n_queued_pages;
> > > > +		list_add(&nentry->queued_link, &vram_mgr-
> > > >queued_pages);
> > > > +		mutex_unlock(&vram_mgr->lock);
> > > > +
> > > > +		/* Purge BO containing address */
> > > > +		 xe_ttm_vram_purge_page(xe, pbo);
> > > > +
> > > > +		/* Reserve page at address addr*/
> > > > +		mutex_lock(&vram_mgr->lock);
> > > > +		ret = gpu_buddy_alloc_blocks(mm, addr, addr + size,
> > > > +					     size, size, &nentry->blocks,
> > > > +
> > > GPU_BUDDY_RANGE_ALLOCATION);
> > > > +
> > > > +		if (ret) {
> > > > +			drm_warn(&xe->drm, "Could not reserve page at
> > > addr:0x%lx, ret:%d\n",
> > > > +				 addr, ret);
> > > > +			nentry->status = fail;
> > > > +			mutex_unlock(&vram_mgr->lock);
> > > > +			return ret;
> > > > +		}
> > > > +
> > > > +		list_for_each_entry_safe(b, m, &nentry->blocks, link)
> > > > +			b->private = NULL;
> > > > +
> > > > +		if ((addr + size) <= vram_mgr->visible_size) {
> > > > +			nentry->used_visible_size = size;
> > > > +		} else {
> > > > +			list_for_each_entry(b, &nentry->blocks, link) {
> > > > +				u64 start = gpu_buddy_block_offset(b);
> > > > +
> > > > +				if (start < vram_mgr->visible_size) {
> > > > +					u64 end = start +
> > > gpu_buddy_block_size(mm, b);
> > > > +
> > > > +					nentry->used_visible_size +=
> > > > +						min(end, vram_mgr-
> > > >visible_size) - start;
> > > > +				}
> > > > +			}
> > > > +		}
> > > > +		vram_mgr->visible_avail -= nentry->used_visible_size;
> > > > +		list_for_each_entry_safe(pos, n, &vram_mgr->queued_pages,
> > > queued_link) {
> > > > +			if (pos->id == nentry->id) {
> > > > +				--vram_mgr->n_queued_pages;
> > > > +				list_del(&pos->queued_link);
> > > > +				break;
> > > > +			}
> > > > +		}
> > > > +		list_add(&nentry->offlined_link, &vram_mgr-
> > > >offlined_pages);
> > > > +		/* TODO: FW Integration: Send command to FW for offlining
> > > page */
> > > > +		++vram_mgr->n_offlined_pages;
> > > > +		mutex_unlock(&vram_mgr->lock);
> > > > +		return ret;
> > > > +
> > > > +	} else {
> > > > +		ret = gpu_buddy_alloc_blocks(mm, addr, addr + size,
> > > > +					     size, size, &nentry->blocks,
> > > > +
> > > GPU_BUDDY_RANGE_ALLOCATION);
> > > > +		if (ret) {
> > > > +			drm_warn(&xe->drm, "Could not reserve page at
> > > addr:0x%lx, ret:%d\n",
> > > > +				 addr, ret);
> > > > +			nentry->status = fail;
> > > > +			mutex_unlock(&vram_mgr->lock);
> > > > +			return ret;
> > > > +		}
> > > > +
> > > > +		list_for_each_entry_safe(b, m, &nentry->blocks, link)
> > > > +			b->private = NULL;
> > > > +
> > > > +		if ((addr + size) <= vram_mgr->visible_size) {
> > > > +			nentry->used_visible_size = size;
> > > > +		} else {
> > > > +			struct gpu_buddy_block *block;
> > > > +
> > > > +			list_for_each_entry(block, &nentry->blocks, link) {
> > > > +				u64 start = gpu_buddy_block_offset(block);
> > > > +
> > > > +				if (start < vram_mgr->visible_size) {
> > > > +					u64 end = start +
> > > gpu_buddy_block_size(mm, block);
> > > > +
> > > > +					nentry->used_visible_size +=
> > > > +						min(end, vram_mgr-
> > > >visible_size) - start;
> > > > +				}
> > > > +			}
> > > > +		}
> > > > +		vram_mgr->visible_avail -= nentry->used_visible_size;
> > > > +		nentry->id = ++vram_mgr->n_offlined_pages;
> > > > +		list_add(&nentry->offlined_link, &vram_mgr-
> > > >offlined_pages);
> > > > +		/* TODO: FW Integration: Send command to FW for offlining
> > > page */
> > > > +		mutex_unlock(&vram_mgr->lock);
> > > > +	}
> > > > +	/* Success */
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +static struct xe_vram_region *xe_ttm_vram_addr_to_region(struct
> > > xe_device *xe,
> > > > +							 resource_size_t addr)
> > > > +{
> > > > +	unsigned long stolen_base = xe_ttm_stolen_gpu_offset(xe);
> > > > +	struct xe_vram_region *vr;
> > > > +	struct xe_tile *tile;
> > > > +	int id;
> > > > +
> > > > +	/* Addr from stolen memory? */
> > > > +	if (addr + SZ_4K >= stolen_base)
> > > > +		return NULL;
> > > > +
> > > > +	for_each_tile(tile, xe, id) {
> > > > +		vr = tile->mem.vram;
> > > > +		if ((addr <= vr->dpa_base + vr->actual_physical_size) &&
> > > > +		    (addr + SZ_4K >= vr->dpa_base))
> > > > +			return vr;
> > > > +	}
> > > > +	return NULL;
> > > > +}
> > > > +
> > > > +/**
> > > > + * xe_ttm_vram_handle_addr_fault - Handle vram physical address
> > > > +error flaged
> > > > + * @xe: pointer to parent device
> > > > + * @addr: physical faulty address
> > > > + *
> > > > + * Handle the physcial faulty address error on specific tile.
> > > > + *
> > > > + * Returns 0 for success, negative error code otherwise.
> > > > + */
> > > > +int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned
> > > > +long
> > > > +addr) {
> > > > +	struct xe_ttm_vram_mgr *vram_mgr;
> > > > +	struct xe_vram_region *vr;
> > > > +	struct gpu_buddy *mm;
> > > > +	int ret;
> > > > +
> > > > +	vr = xe_ttm_vram_addr_to_region(xe, addr);
> > > > +	if (!vr) {
> > > > +		drm_err(&xe->drm, "%s:%d addr:%lx error requesting SBR\n",
> > > > +			__func__, __LINE__, addr);
> > > > +		/* Hint System controller driver for reset with -EIO  */
> > > > +		return -EIO;
> > > > +	}
> > > > +	vram_mgr = &vr->ttm;
> > > > +	mm = &vram_mgr->mm;
> > > > +	/* Reserve page at address */
> > > > +	ret = xe_ttm_vram_reserve_page_at_addr(xe, addr, vram_mgr, mm);
> > > > +	return ret;
> > > > +}
> > > > +EXPORT_SYMBOL(xe_ttm_vram_handle_addr_fault);
> > > > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > index 87b7fae5edba..8ef06d9d44f7 100644
> > > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.h
> > > > @@ -31,6 +31,7 @@ u64 xe_ttm_vram_get_cpu_visible_size(struct
> > > > ttm_resource_manager *man);  void xe_ttm_vram_get_used(struct
> > > ttm_resource_manager *man,
> > > >  			  u64 *used, u64 *used_visible);
> > > >
> > > > +int xe_ttm_vram_handle_addr_fault(struct xe_device *xe, unsigned
> > > > +long addr);
> > > >  static inline struct xe_ttm_vram_mgr_resource *
> > > > to_xe_ttm_vram_mgr_resource(struct ttm_resource *res)  { diff
> > > > --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> > > > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> > > > index 9106da056b49..94eaf9d875f1 100644
> > > > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> > > > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr_types.h
> > > > @@ -19,6 +19,14 @@ struct xe_ttm_vram_mgr {
> > > >  	struct ttm_resource_manager manager;
> > > >  	/** @mm: DRM buddy allocator which manages the VRAM */
> > > >  	struct gpu_buddy mm;
> > > > +	/** @offlined_pages: List of offlined pages */
> > > > +	struct list_head offlined_pages;
> > > > +	/** @n_offlined_pages: Number of offlined pages */
> > > > +	u16 n_offlined_pages;
> > > > +	/** @queued_pages: List of queued pages */
> > > > +	struct list_head queued_pages;
> > > > +	/** @n_queued_pages: Number of queued pages */
> > > > +	u16 n_queued_pages;
> > > >  	/** @visible_size: Proped size of the CPU visible portion */
> > > >  	u64 visible_size;
> > > >  	/** @visible_avail: CPU visible portion still unallocated */ @@
> > > > -45,4 +53,22 @@ struct xe_ttm_vram_mgr_resource {
> > > >  	unsigned long flags;
> > > >  };
> > > >
> > > > +/**
> > > > + * struct xe_ttm_vram_offline_resource - Xe TTM VRAM offline
> > > > +resource  */ struct xe_ttm_vram_offline_resource {
> > > > +	/** @offlined_link: Link to offlined pages */
> > > > +	struct list_head offlined_link;
> > > > +	/** @queued_link: Link to queued pages */
> > > > +	struct list_head queued_link;
> > > > +	/** @blocks: list of DRM buddy blocks */
> > > > +	struct list_head blocks;
> > > > +	/** @used_visible_size: How many CPU visible bytes this resource
> > > > +is
> > > using */
> > > > +	u64 used_visible_size;
> > > > +	/** @id: The id of an offline resource */
> > > > +	u16 id;
> > > > +	/** @status: reservation status of resource */
> > > > +	bool status;
> > > > +};
> > > > +
> > > >  #endif
> > > > --
> > > > 2.52.0
> > > >

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-04-07 12:03 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-27 11:48 [RFC PATCH V6 0/7] Add memory page offlining support Tejas Upadhyay
2026-03-27 11:48 ` [RFC PATCH V6 1/7] drm/xe: Link VRAM object with gpu buddy Tejas Upadhyay
2026-04-01 23:56   ` Matthew Brost
2026-04-02  9:10     ` Upadhyay, Tejas
2026-04-02 20:50       ` Matthew Brost
2026-04-06 11:04         ` Upadhyay, Tejas
2026-03-27 11:48 ` [RFC PATCH V6 2/7] drm/gpu: Add gpu_buddy_addr_to_block helper Tejas Upadhyay
2026-04-02  0:09   ` Matthew Brost
2026-04-02 10:16     ` Matthew Auld
2026-04-02  9:12   ` Matthew Auld
2026-03-27 11:48 ` [RFC PATCH V6 3/7] drm/xe: Handle physical memory address error Tejas Upadhyay
2026-04-01 23:53   ` Matthew Brost
2026-04-02  1:03   ` Matthew Brost
2026-04-02 10:30     ` Upadhyay, Tejas
2026-04-02 20:20       ` Matthew Brost
2026-04-07 12:03         ` Upadhyay, Tejas
2026-03-27 11:48 ` [RFC PATCH V6 4/7] drm/xe/cri: Add debugfs to inject faulty vram address Tejas Upadhyay
2026-03-27 11:48 ` [RFC PATCH V6 5/7] gpu/buddy: Add routine to dump allocated buddy blocks Tejas Upadhyay
2026-03-27 11:48 ` [RFC PATCH V6 6/7] drm/xe/configfs: Add vram bad page reservation policy Tejas Upadhyay
2026-03-27 11:48 ` [RFC PATCH V6 7/7] drm/xe/cri: Add sysfs interface for bad gpu vram pages Tejas Upadhyay
2026-03-27 12:24 ` ✗ CI.checkpatch: warning for Add memory page offlining support (rev6) Patchwork
2026-03-27 12:26 ` ✓ CI.KUnit: success " Patchwork
2026-03-27 13:16 ` ✓ Xe.CI.BAT: " Patchwork
2026-03-28  4:49 ` ✓ Xe.CI.FULL: " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox