[PATCH 00/11] Pagefault refactor, fine grained fault locking, threaded prefetch

intel-xe.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/11] Pagefault refactor, fine grained fault locking, threaded prefetch
@ 2025-08-06  6:22 Matthew Brost
  2025-08-06  6:22 ` [PATCH 01/11] drm/xe: Stub out new pagefault layer Matthew Brost
                   ` (12 more replies)
  0 siblings, 13 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-06  6:22 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, himal.prasad.ghimiray, francois.dugast,
	michal.mrozek

We likely need multiple page fault producers feeding into a common
consumer backend. Additionally, our current page fault work queue
design—being per-GT rather than per-device—makes little sense. Clean
this up ahead of upcoming changes that introduce fine-grained fault
locking and threaded prefetching.

Fine-grained fault locking provides immediate benefits: it allows page
faults from the same VM to be processed in parallel (unless they target
the same range) and enables a sane multi-threaded prefetch
implementation. Longer term, it should help transition GPU SVM from a
per-VM model to a per-MM model, which scales better across multiple VMs
or devices.

Lastly, threaded prefetching enables efficient usage of copy engines.

Matt   

Matthew Brost (11):
  drm/xe: Stub out new pagefault layer
  drm/xe: Implement xe_pagefault_init
  drm/xe: Implement xe_pagefault_reset
  drm/xe: Implement xe_pagefault_handler
  drm/xe: Implement xe_pagefault_queue_work
  drm/xe: Add xe_guc_pagefault layer
  drm/xe: Remove unused GT page fault code
  drm/xe: Fine grained page fault locking
  drm/xe: Allow prefetch-only VM bind IOCTLs to use VM read lock
  drm/xe: Thread prefetch of SVM ranges
  drm/xe: Add num_pf_queue modparam

 drivers/gpu/drm/xe/Makefile             |   3 +-
 drivers/gpu/drm/xe/xe_device.c          |  20 +-
 drivers/gpu/drm/xe/xe_device_types.h    |  10 +
 drivers/gpu/drm/xe/xe_gt.c              |   8 +-
 drivers/gpu/drm/xe/xe_gt_pagefault.c    | 691 ------------------------
 drivers/gpu/drm/xe/xe_gt_pagefault.h    |  19 -
 drivers/gpu/drm/xe/xe_gt_types.h        |  65 ---
 drivers/gpu/drm/xe/xe_guc_ct.c          |   6 +-
 drivers/gpu/drm/xe/xe_guc_pagefault.c   |  94 ++++
 drivers/gpu/drm/xe/xe_guc_pagefault.h   |  13 +
 drivers/gpu/drm/xe/xe_hmm.c             |   4 +-
 drivers/gpu/drm/xe/xe_module.c          |   5 +
 drivers/gpu/drm/xe/xe_module.h          |   1 +
 drivers/gpu/drm/xe/xe_pagefault.c       | 480 ++++++++++++++++
 drivers/gpu/drm/xe/xe_pagefault.h       |  19 +
 drivers/gpu/drm/xe/xe_pagefault_types.h | 125 +++++
 drivers/gpu/drm/xe/xe_svm.c             | 114 ++--
 drivers/gpu/drm/xe/xe_svm.h             |  38 ++
 drivers/gpu/drm/xe/xe_vm.c              | 261 +++++++--
 drivers/gpu/drm/xe/xe_vm.h              |   2 +
 drivers/gpu/drm/xe/xe_vm_types.h        |  28 +-
 21 files changed, 1131 insertions(+), 875 deletions(-)
 delete mode 100644 drivers/gpu/drm/xe/xe_gt_pagefault.c
 delete mode 100644 drivers/gpu/drm/xe/xe_gt_pagefault.h
 create mode 100644 drivers/gpu/drm/xe/xe_guc_pagefault.c
 create mode 100644 drivers/gpu/drm/xe/xe_guc_pagefault.h
 create mode 100644 drivers/gpu/drm/xe/xe_pagefault.c
 create mode 100644 drivers/gpu/drm/xe/xe_pagefault.h
 create mode 100644 drivers/gpu/drm/xe/xe_pagefault_types.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 01/11] drm/xe: Stub out new pagefault layer
  2025-08-06  6:22 [PATCH 00/11] Pagefault refactor, fine grained fault locking, threaded prefetch Matthew Brost
@ 2025-08-06  6:22 ` Matthew Brost
  2025-08-06 23:01   ` Summers, Stuart
                     ` (2 more replies)
  2025-08-06  6:22 ` [PATCH 02/11] drm/xe: Implement xe_pagefault_init Matthew Brost
                   ` (11 subsequent siblings)
  12 siblings, 3 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-06  6:22 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, himal.prasad.ghimiray, francois.dugast,
	michal.mrozek

Stub out the new page fault layer and add kernel documentation. This is
intended as a replacement for the GT page fault layer, enabling multiple
producers to hook into a shared page fault consumer interface.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/Makefile             |   1 +
 drivers/gpu/drm/xe/xe_pagefault.c       |  63 ++++++++++++
 drivers/gpu/drm/xe/xe_pagefault.h       |  19 ++++
 drivers/gpu/drm/xe/xe_pagefault_types.h | 125 ++++++++++++++++++++++++
 4 files changed, 208 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/xe_pagefault.c
 create mode 100644 drivers/gpu/drm/xe/xe_pagefault.h
 create mode 100644 drivers/gpu/drm/xe/xe_pagefault_types.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index 8e0c3412a757..6fbebafe79c9 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -93,6 +93,7 @@ xe-y += xe_bb.o \
 	xe_nvm.o \
 	xe_oa.o \
 	xe_observation.o \
+	xe_pagefault.o \
 	xe_pat.o \
 	xe_pci.o \
 	xe_pcode.o \
diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
new file mode 100644
index 000000000000..3ce0e8d74b9d
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_pagefault.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#include "xe_pagefault.h"
+#include "xe_pagefault_types.h"
+
+/**
+ * DOC: Xe page faults
+ *
+ * Xe page faults are handled in two layers. The producer layer interacts with
+ * hardware or firmware to receive and parse faults into struct xe_pagefault,
+ * then forwards them to the consumer. The consumer layer services the faults
+ * (e.g., memory migration, page table updates) and acknowledges the result back
+ * to the producer, which then forwards the results to the hardware or firmware.
+ * The consumer uses a page fault queue sized to absorb all potential faults and
+ * a multi-threaded worker to process them. Multiple producers are supported,
+ * with a single shared consumer.
+ */
+
+/**
+ * xe_pagefault_init() - Page fault init
+ * @xe: xe device instance
+ *
+ * Initialize Xe page fault state. Must be done after reading fuses.
+ *
+ * Return: 0 on Success, errno on failure
+ */
+int xe_pagefault_init(struct xe_device *xe)
+{
+	/* TODO - implement */
+	return 0;
+}
+
+/**
+ * xe_pagefault_reset() - Page fault reset for a GT
+ * @xe: xe device instance
+ * @gt: GT being reset
+ *
+ * Reset the Xe page fault state for a GT; that is, squash any pending faults on
+ * the GT.
+ */
+void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
+{
+	/* TODO - implement */
+}
+
+/**
+ * xe_pagefault_handler() - Page fault handler
+ * @xe: xe device instance
+ * @pf: Page fault
+ *
+ * Sink the page fault to a queue (i.e., a memory buffer) and queue a worker to
+ * service it. Safe to be called from IRQ or process context. Reclaim safe.
+ *
+ * Return: 0 on success, errno on failure
+ */
+int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf)
+{
+	/* TODO - implement */
+	return 0;
+}
diff --git a/drivers/gpu/drm/xe/xe_pagefault.h b/drivers/gpu/drm/xe/xe_pagefault.h
new file mode 100644
index 000000000000..bd0cdf9ed37f
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_pagefault.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef _XE_PAGEFAULT_H_
+#define _XE_PAGEFAULT_H_
+
+struct xe_device;
+struct xe_gt;
+struct xe_pagefault;
+
+int xe_pagefault_init(struct xe_device *xe);
+
+void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt);
+
+int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf);
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h b/drivers/gpu/drm/xe/xe_pagefault_types.h
new file mode 100644
index 000000000000..fcff84f93dd8
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef _XE_PAGEFAULT_TYPES_H_
+#define _XE_PAGEFAULT_TYPES_H_
+
+#include <linux/workqueue.h>
+
+struct xe_pagefault;
+struct xe_gt;
+
+/** enum xe_pagefault_access_type - Xe page fault access type */
+enum xe_pagefault_access_type {
+	/** @XE_PAGEFAULT_ACCESS_TYPE_READ: Read access type */
+	XE_PAGEFAULT_ACCESS_TYPE_READ	= 0,
+	/** @XE_PAGEFAULT_ACCESS_TYPE_WRITE: Write access type */
+	XE_PAGEFAULT_ACCESS_TYPE_WRITE	= 1,
+	/** @XE_PAGEFAULT_ACCESS_TYPE_ATOMIC: Atomic access type */
+	XE_PAGEFAULT_ACCESS_TYPE_ATOMIC	= 2,
+};
+
+/** enum xe_pagefault_type - Xe page fault type */
+enum xe_pagefault_type {
+	/** @XE_PAGEFAULT_TYPE_NOT_PRESENT: Not present */
+	XE_PAGEFAULT_TYPE_NOT_PRESENT		= 0,
+	/** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Write access violation */
+	XE_PAGEFAULT_WRITE_ACCESS_VIOLATION	= 1,
+	/** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Atomic access violation */
+	XE_PAGEFAULT_ATOMIC_ACCESS_VIOLATION	= 2,
+};
+
+/** struct xe_pagefault_ops - Xe pagefault ops (producer) */
+struct xe_pagefault_ops {
+	/**
+	 * @ack_fault: Ack fault
+	 * @pf: Page fault
+	 * @err: Error state of fault
+	 *
+	 * Page fault producer receives acknowledgment from the consumer and
+	 * sends the result to the HW/FW interface.
+	 */
+	void (*ack_fault)(struct xe_pagefault *pf, int err);
+};
+
+/**
+ * struct xe_pagefault - Xe page fault
+ *
+ * Generic page fault structure for communication between producer and consumer.
+ * Carefully sized to be 64 bytes.
+ */
+struct xe_pagefault {
+	/**
+	 * @gt: GT of fault
+	 *
+	 * XXX: We may want to decouple the GT from individual faults, as it's
+	 * unclear whether future platforms will always have a GT for all page
+	 * fault producers. Internally, the GT is used for stats, identifying
+	 * the appropriate VRAM region, and locating the migration queue.
+	 * Leaving this as-is for now, but we can revisit later to see if we
+	 * can convert it to use the Xe device pointer instead.
+	 */
+	struct xe_gt *gt;
+	/**
+	 * @consumer: State for the software handling the fault. Populated by
+	 * the producer and may be modified by the consumer to communicate
+	 * information back to the producer upon fault acknowledgment.
+	 */
+	struct {
+		/** @consumer.page_addr: address of page fault */
+		u64 page_addr;
+		/** @consumer.asid: address space ID */
+		u32 asid;
+		/** @consumer.access_type: access type */
+		u8 access_type;
+		/** @consumer.fault_type: fault type */
+		u8 fault_type;
+#define XE_PAGEFAULT_LEVEL_NACK		0xff	/* Producer indicates nack fault */
+		/** @consumer.fault_level: fault level */
+		u8 fault_level;
+		/** @consumer.engine_class: engine class */
+		u8 engine_class;
+		/** consumer.reserved: reserved bits for future expansion */
+		u64 reserved;
+	} consumer;
+	/**
+	 * @producer: State for the producer (i.e., HW/FW interface). Populated
+	 * by the producer and should not be modified—or even inspected—by the
+	 * consumer, except for calling operations.
+	 */
+	struct {
+		/** @producer.private: private pointer */
+		void *private;
+		/** @producer.ops: operations */
+		const struct xe_pagefault_ops *ops;
+#define XE_PAGEFAULT_PRODUCER_MSG_LEN_DW	4
+		/**
+		 * producer.msg: page fault message, used by producer in fault
+		 * acknowledgement to formulate response to HW/FW interface.
+		 */
+		u32 msg[XE_PAGEFAULT_PRODUCER_MSG_LEN_DW];
+	} producer;
+};
+
+/** struct xe_pagefault_queue: Xe pagefault queue (consumer) */
+struct xe_pagefault_queue {
+	/**
+	 * @data: Data in queue containing struct xe_pagefault, protected by
+	 * @lock
+	 */
+	void *data;
+	/** @size: Size of queue in bytes */
+	u32 size;
+	/** @head: Head pointer in bytes, moved by producer, protected by @lock */
+	u32 head;
+	/** @tail: Tail pointer in bytes, moved by consumer, protected by @lock */
+	u32 tail;
+	/** @lock: protects page fault queue */
+	spinlock_t lock;
+	/** @worker: to process page faults */
+	struct work_struct worker;
+};
+
+#endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 02/11] drm/xe: Implement xe_pagefault_init
  2025-08-06  6:22 [PATCH 00/11] Pagefault refactor, fine grained fault locking, threaded prefetch Matthew Brost
  2025-08-06  6:22 ` [PATCH 01/11] drm/xe: Stub out new pagefault layer Matthew Brost
@ 2025-08-06  6:22 ` Matthew Brost
  2025-08-06 23:08   ` Summers, Stuart
                     ` (2 more replies)
  2025-08-06  6:22 ` [PATCH 03/11] drm/xe: Implement xe_pagefault_reset Matthew Brost
                   ` (10 subsequent siblings)
  12 siblings, 3 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-06  6:22 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, himal.prasad.ghimiray, francois.dugast,
	michal.mrozek

Create pagefault queues and initialize them.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_device.c       |  5 ++
 drivers/gpu/drm/xe/xe_device_types.h |  6 ++
 drivers/gpu/drm/xe/xe_pagefault.c    | 93 +++++++++++++++++++++++++++-
 3 files changed, 102 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 57edbc63da6f..c7c8aee03841 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -50,6 +50,7 @@
 #include "xe_nvm.h"
 #include "xe_oa.h"
 #include "xe_observation.h"
+#include "xe_pagefault.h"
 #include "xe_pat.h"
 #include "xe_pcode.h"
 #include "xe_pm.h"
@@ -890,6 +891,10 @@ int xe_device_probe(struct xe_device *xe)
 	if (err)
 		return err;
 
+	err = xe_pagefault_init(xe);
+	if (err)
+		return err;
+
 	xe_nvm_init(xe);
 
 	err = xe_heci_gsc_init(xe);
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 01e8fa0d2f9f..6aa119026ce9 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -17,6 +17,7 @@
 #include "xe_lmtt_types.h"
 #include "xe_memirq_types.h"
 #include "xe_oa_types.h"
+#include "xe_pagefault_types.h"
 #include "xe_platform_types.h"
 #include "xe_pmu_types.h"
 #include "xe_pt_types.h"
@@ -394,6 +395,11 @@ struct xe_device {
 		u32 next_asid;
 		/** @usm.lock: protects UM state */
 		struct rw_semaphore lock;
+		/** @usm.pf_wq: page fault work queue, unbound, high priority */
+		struct workqueue_struct *pf_wq;
+#define XE_PAGEFAULT_QUEUE_COUNT	4
+		/** @pf_queue: Page fault queues */
+		struct xe_pagefault_queue pf_queue[XE_PAGEFAULT_QUEUE_COUNT];
 	} usm;
 
 	/** @pinned: pinned BO state */
diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
index 3ce0e8d74b9d..14304c41eb23 100644
--- a/drivers/gpu/drm/xe/xe_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_pagefault.c
@@ -3,6 +3,10 @@
  * Copyright © 2025 Intel Corporation
  */
 
+#include <drm/drm_managed.h>
+
+#include "xe_device.h"
+#include "xe_gt_types.h"
 #include "xe_pagefault.h"
 #include "xe_pagefault_types.h"
 
@@ -19,6 +23,71 @@
  * with a single shared consumer.
  */
 
+static int xe_pagefault_entry_size(void)
+{
+	return roundup_pow_of_two(sizeof(struct xe_pagefault));
+}
+
+static void xe_pagefault_queue_work(struct work_struct *w)
+{
+	/* TODO: Implement */
+}
+
+static int xe_pagefault_queue_init(struct xe_device *xe,
+				   struct xe_pagefault_queue *pf_queue)
+{
+	struct xe_gt *gt;
+	int total_num_eus = 0;
+	u8 id;
+
+	for_each_gt(gt, xe, id) {
+		xe_dss_mask_t all_dss;
+		int num_dss, num_eus;
+
+		bitmap_or(all_dss, gt->fuse_topo.g_dss_mask,
+			  gt->fuse_topo.c_dss_mask, XE_MAX_DSS_FUSE_BITS);
+
+		num_dss = bitmap_weight(all_dss, XE_MAX_DSS_FUSE_BITS);
+		num_eus = bitmap_weight(gt->fuse_topo.eu_mask_per_dss,
+					XE_MAX_EU_FUSE_BITS) * num_dss;
+
+		total_num_eus += num_eus;
+	}
+
+	xe_assert(xe, total_num_eus);
+
+	/*
+	 * user can issue separate page faults per EU and per CS
+	 *
+	 * XXX: Multiplier required as compute UMD are getting PF queue errors
+	 * without it. Follow on why this multiplier is required.
+	 */
+#define PF_MULTIPLIER	8
+	pf_queue->size = (total_num_eus + XE_NUM_HW_ENGINES) *
+		xe_pagefault_entry_size() * PF_MULTIPLIER;
+	pf_queue->size = roundup_pow_of_two(pf_queue->size);
+#undef PF_MULTIPLIER
+
+	drm_dbg(&xe->drm, "xe_pagefault_entry_size=%d, total_num_eus=%d, pf_queue->size=%u",
+		xe_pagefault_entry_size(), total_num_eus, pf_queue->size);
+
+	pf_queue->data = devm_kzalloc(xe->drm.dev, pf_queue->size, GFP_KERNEL);
+	if (!pf_queue->data)
+		return -ENOMEM;
+
+	spin_lock_init(&pf_queue->lock);
+	INIT_WORK(&pf_queue->worker, xe_pagefault_queue_work);
+
+	return 0;
+}
+
+static void xe_pagefault_fini(void *arg)
+{
+	struct xe_device *xe = arg;
+
+	destroy_workqueue(xe->usm.pf_wq);
+}
+
 /**
  * xe_pagefault_init() - Page fault init
  * @xe: xe device instance
@@ -29,8 +98,28 @@
  */
 int xe_pagefault_init(struct xe_device *xe)
 {
-	/* TODO - implement */
-	return 0;
+	int err, i;
+
+	if (!xe->info.has_usm)
+		return 0;
+
+	xe->usm.pf_wq = alloc_workqueue("xe_page_fault_work_queue",
+					WQ_UNBOUND | WQ_HIGHPRI,
+					XE_PAGEFAULT_QUEUE_COUNT);
+	if (!xe->usm.pf_wq)
+		return -ENOMEM;
+
+	for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i) {
+		err = xe_pagefault_queue_init(xe, xe->usm.pf_queue + i);
+		if (err)
+			goto err_out;
+	}
+
+	return devm_add_action_or_reset(xe->drm.dev, xe_pagefault_fini, xe);
+
+err_out:
+	destroy_workqueue(xe->usm.pf_wq);
+	return err;
 }
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 03/11] drm/xe: Implement xe_pagefault_reset
  2025-08-06  6:22 [PATCH 00/11] Pagefault refactor, fine grained fault locking, threaded prefetch Matthew Brost
  2025-08-06  6:22 ` [PATCH 01/11] drm/xe: Stub out new pagefault layer Matthew Brost
  2025-08-06  6:22 ` [PATCH 02/11] drm/xe: Implement xe_pagefault_init Matthew Brost
@ 2025-08-06  6:22 ` Matthew Brost
  2025-08-06 23:16   ` Summers, Stuart
  2025-08-06  6:22 ` [PATCH 04/11] drm/xe: Implement xe_pagefault_handler Matthew Brost
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 51+ messages in thread
From: Matthew Brost @ 2025-08-06  6:22 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, himal.prasad.ghimiray, francois.dugast,
	michal.mrozek

Squash any pending faults on the GT being reset by setting the GT field
in struct xe_pagefault to NULL.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt.c        |  2 ++
 drivers/gpu/drm/xe/xe_pagefault.c | 23 ++++++++++++++++++++++-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index 390394bbaadc..5aa03f89a062 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -50,6 +50,7 @@
 #include "xe_map.h"
 #include "xe_migrate.h"
 #include "xe_mmio.h"
+#include "xe_pagefault.h"
 #include "xe_pat.h"
 #include "xe_pm.h"
 #include "xe_mocs.h"
@@ -846,6 +847,7 @@ static int gt_reset(struct xe_gt *gt)
 
 	xe_uc_gucrc_disable(&gt->uc);
 	xe_uc_stop_prepare(&gt->uc);
+	xe_pagefault_reset(gt_to_xe(gt), gt);
 	xe_gt_pagefault_reset(gt);
 
 	xe_uc_stop(&gt->uc);
diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
index 14304c41eb23..aef389e51612 100644
--- a/drivers/gpu/drm/xe/xe_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_pagefault.c
@@ -122,6 +122,24 @@ int xe_pagefault_init(struct xe_device *xe)
 	return err;
 }
 
+static void xe_pagefault_queue_reset(struct xe_device *xe, struct xe_gt *gt,
+				     struct xe_pagefault_queue *pf_queue)
+{
+	u32 i;
+
+	/* Squash all pending faults on the GT */
+
+	spin_lock_irq(&pf_queue->lock);
+	for (i = pf_queue->tail; i != pf_queue->head;
+	     i = (i + xe_pagefault_entry_size()) % pf_queue->size) {
+		struct xe_pagefault *pf = pf_queue->data + i;
+
+		if (pf->gt == gt)
+			pf->gt = NULL;
+	}
+	spin_unlock_irq(&pf_queue->lock);
+}
+
 /**
  * xe_pagefault_reset() - Page fault reset for a GT
  * @xe: xe device instance
@@ -132,7 +150,10 @@ int xe_pagefault_init(struct xe_device *xe)
  */
 void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
 {
-	/* TODO - implement */
+	int i;
+
+	for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i)
+		xe_pagefault_queue_reset(xe, gt, xe->usm.pf_queue + i);
 }
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 04/11] drm/xe: Implement xe_pagefault_handler
  2025-08-06  6:22 [PATCH 00/11] Pagefault refactor, fine grained fault locking, threaded prefetch Matthew Brost
                   ` (2 preceding siblings ...)
  2025-08-06  6:22 ` [PATCH 03/11] drm/xe: Implement xe_pagefault_reset Matthew Brost
@ 2025-08-06  6:22 ` Matthew Brost
  2025-08-28 11:26   ` Francois Dugast
  2025-08-28 20:24   ` Summers, Stuart
  2025-08-06  6:22 ` [PATCH 05/11] drm/xe: Implement xe_pagefault_queue_work Matthew Brost
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-06  6:22 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, himal.prasad.ghimiray, francois.dugast,
	michal.mrozek

Enqueue (copy) the input struct xe_pagefault into a queue (i.e., into a
memory buffer) and schedule a worker to service it.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_pagefault.c | 32 +++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
index aef389e51612..98be3203a9df 100644
--- a/drivers/gpu/drm/xe/xe_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_pagefault.c
@@ -3,6 +3,8 @@
  * Copyright © 2025 Intel Corporation
  */
 
+#include <linux/circ_buf.h>
+
 #include <drm/drm_managed.h>
 
 #include "xe_device.h"
@@ -156,6 +158,14 @@ void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
 		xe_pagefault_queue_reset(xe, gt, xe->usm.pf_queue + i);
 }
 
+static bool xe_pagefault_queue_full(struct xe_pagefault_queue *pf_queue)
+{
+	lockdep_assert_held(&pf_queue->lock);
+
+	return CIRC_SPACE(pf_queue->head, pf_queue->tail, pf_queue->size) <=
+		xe_pagefault_entry_size();
+}
+
 /**
  * xe_pagefault_handler() - Page fault handler
  * @xe: xe device instance
@@ -168,6 +178,24 @@ void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
  */
 int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf)
 {
-	/* TODO - implement */
-	return 0;
+	struct xe_pagefault_queue *pf_queue = xe->usm.pf_queue +
+		(pf->consumer.asid % XE_PAGEFAULT_QUEUE_COUNT);
+	unsigned long flags;
+	bool full;
+
+	spin_lock_irqsave(&pf_queue->lock, flags);
+	full = xe_pagefault_queue_full(pf_queue);
+	if (!full) {
+		memcpy(pf_queue->data + pf_queue->head, pf, sizeof(*pf));
+		pf_queue->head = (pf_queue->head + xe_pagefault_entry_size()) %
+			pf_queue->size;
+		queue_work(xe->usm.pf_wq, &pf_queue->worker);
+	} else {
+		drm_warn(&xe->drm,
+			 "PageFault Queue (%d) full, shouldn't be possible\n",
+			 pf->consumer.asid % XE_PAGEFAULT_QUEUE_COUNT);
+	}
+	spin_unlock_irqrestore(&pf_queue->lock, flags);
+
+	return full ? -ENOSPC : 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 05/11] drm/xe: Implement xe_pagefault_queue_work
  2025-08-06  6:22 [PATCH 00/11] Pagefault refactor, fine grained fault locking, threaded prefetch Matthew Brost
                   ` (3 preceding siblings ...)
  2025-08-06  6:22 ` [PATCH 04/11] drm/xe: Implement xe_pagefault_handler Matthew Brost
@ 2025-08-06  6:22 ` Matthew Brost
  2025-08-28 12:29   ` Francois Dugast
  2025-08-28 22:04   ` Summers, Stuart
  2025-08-06  6:22 ` [PATCH 06/11] drm/xe: Add xe_guc_pagefault layer Matthew Brost
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-06  6:22 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, himal.prasad.ghimiray, francois.dugast,
	michal.mrozek

Implement a worker that services page faults, using the same
implementation as in xe_gt_pagefault.c.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_pagefault.c | 240 +++++++++++++++++++++++++++++-
 1 file changed, 239 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
index 98be3203a9df..474412c21ec3 100644
--- a/drivers/gpu/drm/xe/xe_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_pagefault.c
@@ -5,12 +5,20 @@
 
 #include <linux/circ_buf.h>
 
+#include <drm/drm_exec.h>
 #include <drm/drm_managed.h>
 
+#include "xe_bo.h"
 #include "xe_device.h"
+#include "xe_gt_printk.h"
 #include "xe_gt_types.h"
+#include "xe_gt_stats.h"
+#include "xe_hw_engine.h"
 #include "xe_pagefault.h"
 #include "xe_pagefault_types.h"
+#include "xe_svm.h"
+#include "xe_trace_bo.h"
+#include "xe_vm.h"
 
 /**
  * DOC: Xe page faults
@@ -30,9 +38,239 @@ static int xe_pagefault_entry_size(void)
 	return roundup_pow_of_two(sizeof(struct xe_pagefault));
 }
 
+static int xe_pagefault_begin(struct drm_exec *exec, struct xe_vma *vma,
+			      bool atomic, unsigned int id)
+{
+	struct xe_bo *bo = xe_vma_bo(vma);
+	struct xe_vm *vm = xe_vma_vm(vma);
+	int err;
+
+	err = xe_vm_lock_vma(exec, vma);
+	if (err)
+		return err;
+
+	if (atomic && IS_DGFX(vm->xe)) {
+		if (xe_vma_is_userptr(vma)) {
+			err = -EACCES;
+			return err;
+		}
+
+		/* Migrate to VRAM, move should invalidate the VMA first */
+		err = xe_bo_migrate(bo, XE_PL_VRAM0 + id);
+		if (err)
+			return err;
+	} else if (bo) {
+		/* Create backing store if needed */
+		err = xe_bo_validate(bo, vm, true);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int xe_pagefault_handle_vma(struct xe_gt *gt, struct xe_vma *vma,
+				   bool atomic)
+{
+	struct xe_vm *vm = xe_vma_vm(vma);
+	struct xe_tile *tile = gt_to_tile(gt);
+	struct drm_exec exec;
+	struct dma_fence *fence;
+	ktime_t end = 0;
+	int err;
+
+	lockdep_assert_held_write(&vm->lock);
+
+	xe_gt_stats_incr(gt, XE_GT_STATS_ID_VMA_PAGEFAULT_COUNT, 1);
+	xe_gt_stats_incr(gt, XE_GT_STATS_ID_VMA_PAGEFAULT_KB,
+			 xe_vma_size(vma) / SZ_1K);
+
+	trace_xe_vma_pagefault(vma);
+
+	/* Check if VMA is valid, opportunistic check only */
+	if (xe_vm_has_valid_gpu_mapping(tile, vma->tile_present,
+					vma->tile_invalidated) && !atomic)
+		return 0;
+
+retry_userptr:
+	if (xe_vma_is_userptr(vma) &&
+	    xe_vma_userptr_check_repin(to_userptr_vma(vma))) {
+		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
+
+		err = xe_vma_userptr_pin_pages(uvma);
+		if (err)
+			return err;
+	}
+
+	/* Lock VM and BOs dma-resv */
+	drm_exec_init(&exec, 0, 0);
+	drm_exec_until_all_locked(&exec) {
+		err = xe_pagefault_begin(&exec, vma, atomic, tile->id);
+		drm_exec_retry_on_contention(&exec);
+		if (xe_vm_validate_should_retry(&exec, err, &end))
+			err = -EAGAIN;
+		if (err)
+			goto unlock_dma_resv;
+
+		/* Bind VMA only to the GT that has faulted */
+		trace_xe_vma_pf_bind(vma);
+		fence = xe_vma_rebind(vm, vma, BIT(tile->id));
+		if (IS_ERR(fence)) {
+			err = PTR_ERR(fence);
+			if (xe_vm_validate_should_retry(&exec, err, &end))
+				err = -EAGAIN;
+			goto unlock_dma_resv;
+		}
+	}
+
+	dma_fence_wait(fence, false);
+	dma_fence_put(fence);
+
+unlock_dma_resv:
+	drm_exec_fini(&exec);
+	if (err == -EAGAIN)
+		goto retry_userptr;
+
+	return err;
+}
+
+static bool
+xe_pagefault_access_is_atomic(enum xe_pagefault_access_type access_type)
+{
+	return access_type == XE_PAGEFAULT_ACCESS_TYPE_ATOMIC;
+}
+
+static struct xe_vm *xe_pagefault_asid_to_vm(struct xe_device *xe, u32 asid)
+{
+	struct xe_vm *vm;
+
+	down_read(&xe->usm.lock);
+	vm = xa_load(&xe->usm.asid_to_vm, asid);
+	if (vm && xe_vm_in_fault_mode(vm))
+		xe_vm_get(vm);
+	else
+		vm = ERR_PTR(-EINVAL);
+	up_read(&xe->usm.lock);
+
+	return vm;
+}
+
+static int xe_pagefault_service(struct xe_pagefault *pf)
+{
+	struct xe_gt *gt = pf->gt;
+	struct xe_device *xe = gt_to_xe(gt);
+	struct xe_vm *vm;
+	struct xe_vma *vma = NULL;
+	int err;
+	bool atomic;
+
+	/* Producer flagged this fault to be nacked */
+	if (pf->consumer.fault_level == XE_PAGEFAULT_LEVEL_NACK)
+		return -EFAULT;
+
+	vm = xe_pagefault_asid_to_vm(xe, pf->consumer.asid);
+	if (IS_ERR(vm))
+		return PTR_ERR(vm);
+
+	/*
+	 * TODO: Change to read lock? Using write lock for simplicity.
+	 */
+	down_write(&vm->lock);
+
+	if (xe_vm_is_closed(vm)) {
+		err = -ENOENT;
+		goto unlock_vm;
+	}
+
+	vma = xe_vm_find_vma_by_addr(vm, pf->consumer.page_addr);
+	if (!vma) {
+		err = -EINVAL;
+		goto unlock_vm;
+	}
+
+	atomic = xe_pagefault_access_is_atomic(pf->consumer.access_type);
+
+	if (xe_vma_is_cpu_addr_mirror(vma))
+		err = xe_svm_handle_pagefault(vm, vma, gt,
+					      pf->consumer.page_addr, atomic);
+	else
+		err = xe_pagefault_handle_vma(gt, vma, atomic);
+
+unlock_vm:
+	if (!err)
+		vm->usm.last_fault_vma = vma;
+	up_write(&vm->lock);
+	xe_vm_put(vm);
+
+	return err;
+}
+
+static bool xe_pagefault_queue_pop(struct xe_pagefault_queue *pf_queue,
+				   struct xe_pagefault *pf)
+{
+	bool found_fault = false;
+
+	spin_lock_irq(&pf_queue->lock);
+	if (pf_queue->tail != pf_queue->head) {
+		memcpy(pf, pf_queue->data + pf_queue->tail, sizeof(*pf));
+		pf_queue->tail = (pf_queue->tail + xe_pagefault_entry_size()) %
+			pf_queue->size;
+		found_fault = true;
+	}
+	spin_unlock_irq(&pf_queue->lock);
+
+	return found_fault;
+}
+
+static void xe_pagefault_print(struct xe_pagefault *pf)
+{
+	xe_gt_dbg(pf->gt, "\n\tASID: %d\n"
+		  "\tFaulted Address: 0x%08x%08x\n"
+		  "\tFaultType: %d\n"
+		  "\tAccessType: %d\n"
+		  "\tFaultLevel: %d\n"
+		  "\tEngineClass: %d %s\n",
+		  pf->consumer.asid,
+		  upper_32_bits(pf->consumer.page_addr),
+		  lower_32_bits(pf->consumer.page_addr),
+		  pf->consumer.fault_type,
+		  pf->consumer.access_type,
+		  pf->consumer.fault_level,
+		  pf->consumer.engine_class,
+		  xe_hw_engine_class_to_str(pf->consumer.engine_class));
+}
+
 static void xe_pagefault_queue_work(struct work_struct *w)
 {
-	/* TODO: Implement */
+	struct xe_pagefault_queue *pf_queue =
+		container_of(w, typeof(*pf_queue), worker);
+	struct xe_pagefault pf;
+	unsigned long threshold;
+
+#define USM_QUEUE_MAX_RUNTIME_MS      20
+	threshold = jiffies + msecs_to_jiffies(USM_QUEUE_MAX_RUNTIME_MS);
+
+	while (xe_pagefault_queue_pop(pf_queue, &pf)) {
+		int err;
+
+		if (!pf.gt)	/* Fault squashed during reset */
+			continue;
+
+		err = xe_pagefault_service(&pf);
+		if (err) {
+			xe_pagefault_print(&pf);
+			xe_gt_dbg(pf.gt, "Fault response: Unsuccessful %pe\n",
+				  ERR_PTR(err));
+		}
+
+		pf.producer.ops->ack_fault(&pf, err);
+
+		if (time_after(jiffies, threshold)) {
+			queue_work(gt_to_xe(pf.gt)->usm.pf_wq, w);
+			break;
+		}
+	}
+#undef USM_QUEUE_MAX_RUNTIME_MS
 }
 
 static int xe_pagefault_queue_init(struct xe_device *xe,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 06/11] drm/xe: Add xe_guc_pagefault layer
  2025-08-06  6:22 [PATCH 00/11] Pagefault refactor, fine grained fault locking, threaded prefetch Matthew Brost
                   ` (4 preceding siblings ...)
  2025-08-06  6:22 ` [PATCH 05/11] drm/xe: Implement xe_pagefault_queue_work Matthew Brost
@ 2025-08-06  6:22 ` Matthew Brost
  2025-08-28 13:27   ` Francois Dugast
  2025-08-28 22:11   ` Summers, Stuart
  2025-08-06  6:22 ` [PATCH 07/11] drm/xe: Remove unused GT page fault code Matthew Brost
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-06  6:22 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, himal.prasad.ghimiray, francois.dugast,
	michal.mrozek

Add xe_guc_pagefault layer (producer) which parses G2H fault messages
messages into struct xe_pagefault, forwards them to the page fault layer
(consumer) for servicing, and provides a vfunc to acknowledge faults to
the GuC upon completion. Replace the old (and incorrect) GT page fault
layer with this new layer throughout the driver.

Signed-off-bt: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/Makefile           |  2 +-
 drivers/gpu/drm/xe/xe_gt.c            |  6 --
 drivers/gpu/drm/xe/xe_guc_ct.c        |  6 +-
 drivers/gpu/drm/xe/xe_guc_pagefault.c | 94 +++++++++++++++++++++++++++
 drivers/gpu/drm/xe/xe_guc_pagefault.h | 13 ++++
 drivers/gpu/drm/xe/xe_svm.c           |  3 +-
 drivers/gpu/drm/xe/xe_vm.c            |  1 -
 7 files changed, 110 insertions(+), 15 deletions(-)
 create mode 100644 drivers/gpu/drm/xe/xe_guc_pagefault.c
 create mode 100644 drivers/gpu/drm/xe/xe_guc_pagefault.h

diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
index 6fbebafe79c9..c103c114b75c 100644
--- a/drivers/gpu/drm/xe/Makefile
+++ b/drivers/gpu/drm/xe/Makefile
@@ -58,7 +58,6 @@ xe-y += xe_bb.o \
 	xe_gt_freq.o \
 	xe_gt_idle.o \
 	xe_gt_mcr.o \
-	xe_gt_pagefault.o \
 	xe_gt_sysfs.o \
 	xe_gt_throttle.o \
 	xe_gt_tlb_invalidation.o \
@@ -75,6 +74,7 @@ xe-y += xe_bb.o \
 	xe_guc_id_mgr.o \
 	xe_guc_klv_helpers.o \
 	xe_guc_log.o \
+	xe_guc_pagefault.o \
 	xe_guc_pc.o \
 	xe_guc_submit.o \
 	xe_heci_gsc.o \
diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
index 5aa03f89a062..35c7ba7828a6 100644
--- a/drivers/gpu/drm/xe/xe_gt.c
+++ b/drivers/gpu/drm/xe/xe_gt.c
@@ -32,7 +32,6 @@
 #include "xe_gt_freq.h"
 #include "xe_gt_idle.h"
 #include "xe_gt_mcr.h"
-#include "xe_gt_pagefault.h"
 #include "xe_gt_printk.h"
 #include "xe_gt_sriov_pf.h"
 #include "xe_gt_sriov_vf.h"
@@ -634,10 +633,6 @@ int xe_gt_init(struct xe_gt *gt)
 	if (err)
 		return err;
 
-	err = xe_gt_pagefault_init(gt);
-	if (err)
-		return err;
-
 	err = xe_gt_idle_init(&gt->gtidle);
 	if (err)
 		return err;
@@ -848,7 +843,6 @@ static int gt_reset(struct xe_gt *gt)
 	xe_uc_gucrc_disable(&gt->uc);
 	xe_uc_stop_prepare(&gt->uc);
 	xe_pagefault_reset(gt_to_xe(gt), gt);
-	xe_gt_pagefault_reset(gt);
 
 	xe_uc_stop(&gt->uc);
 
diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index 3f4e6a46ff16..67b5dd182207 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -21,7 +21,6 @@
 #include "xe_devcoredump.h"
 #include "xe_device.h"
 #include "xe_gt.h"
-#include "xe_gt_pagefault.h"
 #include "xe_gt_printk.h"
 #include "xe_gt_sriov_pf_control.h"
 #include "xe_gt_sriov_pf_monitor.h"
@@ -29,6 +28,7 @@
 #include "xe_gt_tlb_invalidation.h"
 #include "xe_guc.h"
 #include "xe_guc_log.h"
+#include "xe_guc_pagefault.h"
 #include "xe_guc_relay.h"
 #include "xe_guc_submit.h"
 #include "xe_map.h"
@@ -1419,10 +1419,6 @@ static int process_g2h_msg(struct xe_guc_ct *ct, u32 *msg, u32 len)
 		ret = xe_guc_tlb_invalidation_done_handler(guc, payload,
 							   adj_len);
 		break;
-	case XE_GUC_ACTION_ACCESS_COUNTER_NOTIFY:
-		ret = xe_guc_access_counter_notify_handler(guc, payload,
-							   adj_len);
-		break;
 	case XE_GUC_ACTION_GUC2PF_RELAY_FROM_VF:
 		ret = xe_guc_relay_process_guc2pf(&guc->relay, hxg, hxg_len);
 		break;
diff --git a/drivers/gpu/drm/xe/xe_guc_pagefault.c b/drivers/gpu/drm/xe/xe_guc_pagefault.c
new file mode 100644
index 000000000000..0aa069d2a581
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_guc_pagefault.c
@@ -0,0 +1,94 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#include "abi/guc_actions_abi.h"
+#include "xe_guc.h"
+#include "xe_guc_ct.h"
+#include "xe_guc_pagefault.h"
+#include "xe_pagefault.h"
+
+static void guc_ack_fault(struct xe_pagefault *pf, int err)
+{
+	u32 vfid = FIELD_GET(PFD_VFID, pf->producer.msg[2]);
+	u32 engine_instance = FIELD_GET(PFD_ENG_INSTANCE, pf->producer.msg[0]);
+	u32 engine_class = FIELD_GET(PFD_ENG_CLASS, pf->producer.msg[0]);
+	u32 pdata = FIELD_GET(PFD_PDATA_LO, pf->producer.msg[0]) |
+		(FIELD_GET(PFD_PDATA_HI, pf->producer.msg[1]) <<
+		 PFD_PDATA_HI_SHIFT);
+	u32 action[] = {
+		XE_GUC_ACTION_PAGE_FAULT_RES_DESC,
+
+		FIELD_PREP(PFR_VALID, 1) |
+		FIELD_PREP(PFR_SUCCESS, !!err) |
+		FIELD_PREP(PFR_REPLY, PFR_ACCESS) |
+		FIELD_PREP(PFR_DESC_TYPE, FAULT_RESPONSE_DESC) |
+		FIELD_PREP(PFR_ASID, pf->consumer.asid),
+
+		FIELD_PREP(PFR_VFID, vfid) |
+		FIELD_PREP(PFR_ENG_INSTANCE, engine_instance) |
+		FIELD_PREP(PFR_ENG_CLASS, engine_class) |
+		FIELD_PREP(PFR_PDATA, pdata),
+	};
+	struct xe_guc *guc = pf->producer.private;
+
+	xe_guc_ct_send(&guc->ct, action, ARRAY_SIZE(action), 0, 0);
+}
+
+static const struct xe_pagefault_ops guc_pagefault_ops = {
+	.ack_fault = guc_ack_fault,
+};
+
+/**
+ * xe_guc_pagefault_handler() - G2H page fault handler
+ * @guc: GuC object
+ * @msg: G2H message
+ * @len: Length of G2H message
+ *
+ * Parse GuC to host (G2H) message into a struct xe_pagefault and forward onto
+ * the Xe page fault layer.
+ *
+ * Return: 0 on success, errno on failure
+ */
+int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len)
+{
+	struct xe_pagefault pf;
+	int i;
+
+#define GUC_PF_MSG_LEN_DW	\
+	(sizeof(struct xe_guc_pagefault_desc) / sizeof(u32))
+
+	BUILD_BUG_ON(GUC_PF_MSG_LEN_DW > XE_PAGEFAULT_PRODUCER_MSG_LEN_DW);
+
+	if (len != GUC_PF_MSG_LEN_DW)
+		return -EPROTO;
+
+	pf.gt = guc_to_gt(guc);
+
+	/*
+	 * XXX: These values happen to match the enum in xe_pagefault_types.h.
+	 * If that changes, we’ll need to remap them here.
+	 */
+	pf.consumer.page_addr = (u64)(FIELD_GET(PFD_VIRTUAL_ADDR_HI, msg[3])
+				      << PFD_VIRTUAL_ADDR_HI_SHIFT) |
+		(FIELD_GET(PFD_VIRTUAL_ADDR_LO, msg[2]) <<
+		 PFD_VIRTUAL_ADDR_LO_SHIFT);
+	pf.consumer.asid = FIELD_GET(PFD_ASID, msg[1]);
+	pf.consumer.access_type = FIELD_GET(PFD_ACCESS_TYPE, msg[2]);;
+	pf.consumer.fault_type = FIELD_GET(PFD_FAULT_TYPE, msg[2]);
+	if (FIELD_GET(XE2_PFD_TRVA_FAULT, msg[0]))
+		pf.consumer.fault_level = XE_PAGEFAULT_LEVEL_NACK;
+	else
+		pf.consumer.fault_level = FIELD_GET(PFD_FAULT_LEVEL, msg[0]);
+	pf.consumer.engine_class = FIELD_GET(PFD_ENG_CLASS, msg[0]);
+
+	pf.producer.private = guc;
+	pf.producer.ops = &guc_pagefault_ops;
+	for (i = 0; i < GUC_PF_MSG_LEN_DW; ++i)
+		pf.producer.msg[i] = msg[i];
+
+#undef GUC_PF_MSG_LEN_DW
+
+	return xe_pagefault_handler(guc_to_xe(guc), &pf);
+}
diff --git a/drivers/gpu/drm/xe/xe_guc_pagefault.h b/drivers/gpu/drm/xe/xe_guc_pagefault.h
new file mode 100644
index 000000000000..0723f57b8ea9
--- /dev/null
+++ b/drivers/gpu/drm/xe/xe_guc_pagefault.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2025 Intel Corporation
+ */
+
+#ifndef _XE_GUC_PAGEFAULT_H_
+#define _XE_GUC_PAGEFAULT_H_
+
+#include <linux/types.h>
+
+int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len);
+
+#endif
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 10c8a1bcb86e..1bcf3ba3b350 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -109,8 +109,7 @@ xe_svm_garbage_collector_add_range(struct xe_vm *vm, struct xe_svm_range *range,
 			      &vm->svm.garbage_collector.range_list);
 	spin_unlock(&vm->svm.garbage_collector.lock);
 
-	queue_work(xe_device_get_root_tile(xe)->primary_gt->usm.pf_wq,
-		   &vm->svm.garbage_collector.work);
+	queue_work(xe->usm.pf_wq, &vm->svm.garbage_collector.work);
 }
 
 static u8
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 432ea325677d..c9ae13c32117 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -27,7 +27,6 @@
 #include "xe_device.h"
 #include "xe_drm_client.h"
 #include "xe_exec_queue.h"
-#include "xe_gt_pagefault.h"
 #include "xe_gt_tlb_invalidation.h"
 #include "xe_migrate.h"
 #include "xe_pat.h"
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 07/11] drm/xe: Remove unused GT page fault code
  2025-08-06  6:22 [PATCH 00/11] Pagefault refactor, fine grained fault locking, threaded prefetch Matthew Brost
                   ` (5 preceding siblings ...)
  2025-08-06  6:22 ` [PATCH 06/11] drm/xe: Add xe_guc_pagefault layer Matthew Brost
@ 2025-08-06  6:22 ` Matthew Brost
  2025-08-28 19:13   ` Summers, Stuart
  2025-08-06  6:22 ` [PATCH 08/11] drm/xe: Fine grained page fault locking Matthew Brost
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 51+ messages in thread
From: Matthew Brost @ 2025-08-06  6:22 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, himal.prasad.ghimiray, francois.dugast,
	michal.mrozek

With the Xe page fault layer and GuC page layer in place, this is now
dead code and can be removed. ACC code is also removed, but this was
dead code.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_pagefault.c | 691 ---------------------------
 drivers/gpu/drm/xe/xe_gt_pagefault.h |  19 -
 drivers/gpu/drm/xe/xe_gt_types.h     |  65 ---
 3 files changed, 775 deletions(-)
 delete mode 100644 drivers/gpu/drm/xe/xe_gt_pagefault.c
 delete mode 100644 drivers/gpu/drm/xe/xe_gt_pagefault.h

diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c
deleted file mode 100644
index ab43dec52776..000000000000
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
+++ /dev/null
@@ -1,691 +0,0 @@
-// SPDX-License-Identifier: MIT
-/*
- * Copyright © 2022 Intel Corporation
- */
-
-#include "xe_gt_pagefault.h"
-
-#include <linux/bitfield.h>
-#include <linux/circ_buf.h>
-
-#include <drm/drm_exec.h>
-#include <drm/drm_managed.h>
-
-#include "abi/guc_actions_abi.h"
-#include "xe_bo.h"
-#include "xe_gt.h"
-#include "xe_gt_printk.h"
-#include "xe_gt_stats.h"
-#include "xe_gt_tlb_invalidation.h"
-#include "xe_guc.h"
-#include "xe_guc_ct.h"
-#include "xe_migrate.h"
-#include "xe_svm.h"
-#include "xe_trace_bo.h"
-#include "xe_vm.h"
-#include "xe_vram_types.h"
-
-struct pagefault {
-	u64 page_addr;
-	u32 asid;
-	u16 pdata;
-	u8 vfid;
-	u8 access_type;
-	u8 fault_type;
-	u8 fault_level;
-	u8 engine_class;
-	u8 engine_instance;
-	u8 fault_unsuccessful;
-	bool trva_fault;
-};
-
-enum access_type {
-	ACCESS_TYPE_READ = 0,
-	ACCESS_TYPE_WRITE = 1,
-	ACCESS_TYPE_ATOMIC = 2,
-	ACCESS_TYPE_RESERVED = 3,
-};
-
-enum fault_type {
-	NOT_PRESENT = 0,
-	WRITE_ACCESS_VIOLATION = 1,
-	ATOMIC_ACCESS_VIOLATION = 2,
-};
-
-struct acc {
-	u64 va_range_base;
-	u32 asid;
-	u32 sub_granularity;
-	u8 granularity;
-	u8 vfid;
-	u8 access_type;
-	u8 engine_class;
-	u8 engine_instance;
-};
-
-static bool access_is_atomic(enum access_type access_type)
-{
-	return access_type == ACCESS_TYPE_ATOMIC;
-}
-
-static bool vma_is_valid(struct xe_tile *tile, struct xe_vma *vma)
-{
-	return xe_vm_has_valid_gpu_mapping(tile, vma->tile_present,
-					   vma->tile_invalidated);
-}
-
-static int xe_pf_begin(struct drm_exec *exec, struct xe_vma *vma,
-		       bool atomic, struct xe_vram_region *vram)
-{
-	struct xe_bo *bo = xe_vma_bo(vma);
-	struct xe_vm *vm = xe_vma_vm(vma);
-	int err;
-
-	err = xe_vm_lock_vma(exec, vma);
-	if (err)
-		return err;
-
-	if (atomic && vram) {
-		xe_assert(vm->xe, IS_DGFX(vm->xe));
-
-		if (xe_vma_is_userptr(vma)) {
-			err = -EACCES;
-			return err;
-		}
-
-		/* Migrate to VRAM, move should invalidate the VMA first */
-		err = xe_bo_migrate(bo, vram->placement);
-		if (err)
-			return err;
-	} else if (bo) {
-		/* Create backing store if needed */
-		err = xe_bo_validate(bo, vm, true);
-		if (err)
-			return err;
-	}
-
-	return 0;
-}
-
-static int handle_vma_pagefault(struct xe_gt *gt, struct xe_vma *vma,
-				bool atomic)
-{
-	struct xe_vm *vm = xe_vma_vm(vma);
-	struct xe_tile *tile = gt_to_tile(gt);
-	struct drm_exec exec;
-	struct dma_fence *fence;
-	ktime_t end = 0;
-	int err;
-
-	lockdep_assert_held_write(&vm->lock);
-
-	xe_gt_stats_incr(gt, XE_GT_STATS_ID_VMA_PAGEFAULT_COUNT, 1);
-	xe_gt_stats_incr(gt, XE_GT_STATS_ID_VMA_PAGEFAULT_KB, xe_vma_size(vma) / 1024);
-
-	trace_xe_vma_pagefault(vma);
-
-	/* Check if VMA is valid, opportunistic check only */
-	if (vma_is_valid(tile, vma) && !atomic)
-		return 0;
-
-retry_userptr:
-	if (xe_vma_is_userptr(vma) &&
-	    xe_vma_userptr_check_repin(to_userptr_vma(vma))) {
-		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
-
-		err = xe_vma_userptr_pin_pages(uvma);
-		if (err)
-			return err;
-	}
-
-	/* Lock VM and BOs dma-resv */
-	drm_exec_init(&exec, 0, 0);
-	drm_exec_until_all_locked(&exec) {
-		err = xe_pf_begin(&exec, vma, atomic, tile->mem.vram);
-		drm_exec_retry_on_contention(&exec);
-		if (xe_vm_validate_should_retry(&exec, err, &end))
-			err = -EAGAIN;
-		if (err)
-			goto unlock_dma_resv;
-
-		/* Bind VMA only to the GT that has faulted */
-		trace_xe_vma_pf_bind(vma);
-		fence = xe_vma_rebind(vm, vma, BIT(tile->id));
-		if (IS_ERR(fence)) {
-			err = PTR_ERR(fence);
-			if (xe_vm_validate_should_retry(&exec, err, &end))
-				err = -EAGAIN;
-			goto unlock_dma_resv;
-		}
-	}
-
-	dma_fence_wait(fence, false);
-	dma_fence_put(fence);
-
-unlock_dma_resv:
-	drm_exec_fini(&exec);
-	if (err == -EAGAIN)
-		goto retry_userptr;
-
-	return err;
-}
-
-static struct xe_vm *asid_to_vm(struct xe_device *xe, u32 asid)
-{
-	struct xe_vm *vm;
-
-	down_read(&xe->usm.lock);
-	vm = xa_load(&xe->usm.asid_to_vm, asid);
-	if (vm && xe_vm_in_fault_mode(vm))
-		xe_vm_get(vm);
-	else
-		vm = ERR_PTR(-EINVAL);
-	up_read(&xe->usm.lock);
-
-	return vm;
-}
-
-static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
-{
-	struct xe_device *xe = gt_to_xe(gt);
-	struct xe_vm *vm;
-	struct xe_vma *vma = NULL;
-	int err;
-	bool atomic;
-
-	/* SW isn't expected to handle TRTT faults */
-	if (pf->trva_fault)
-		return -EFAULT;
-
-	vm = asid_to_vm(xe, pf->asid);
-	if (IS_ERR(vm))
-		return PTR_ERR(vm);
-
-	/*
-	 * TODO: Change to read lock? Using write lock for simplicity.
-	 */
-	down_write(&vm->lock);
-
-	if (xe_vm_is_closed(vm)) {
-		err = -ENOENT;
-		goto unlock_vm;
-	}
-
-	vma = xe_vm_find_vma_by_addr(vm, pf->page_addr);
-	if (!vma) {
-		err = -EINVAL;
-		goto unlock_vm;
-	}
-
-	atomic = access_is_atomic(pf->access_type);
-
-	if (xe_vma_is_cpu_addr_mirror(vma))
-		err = xe_svm_handle_pagefault(vm, vma, gt,
-					      pf->page_addr, atomic);
-	else
-		err = handle_vma_pagefault(gt, vma, atomic);
-
-unlock_vm:
-	if (!err)
-		vm->usm.last_fault_vma = vma;
-	up_write(&vm->lock);
-	xe_vm_put(vm);
-
-	return err;
-}
-
-static int send_pagefault_reply(struct xe_guc *guc,
-				struct xe_guc_pagefault_reply *reply)
-{
-	u32 action[] = {
-		XE_GUC_ACTION_PAGE_FAULT_RES_DESC,
-		reply->dw0,
-		reply->dw1,
-	};
-
-	return xe_guc_ct_send(&guc->ct, action, ARRAY_SIZE(action), 0, 0);
-}
-
-static void print_pagefault(struct xe_gt *gt, struct pagefault *pf)
-{
-	xe_gt_dbg(gt, "\n\tASID: %d\n"
-		  "\tVFID: %d\n"
-		  "\tPDATA: 0x%04x\n"
-		  "\tFaulted Address: 0x%08x%08x\n"
-		  "\tFaultType: %d\n"
-		  "\tAccessType: %d\n"
-		  "\tFaultLevel: %d\n"
-		  "\tEngineClass: %d %s\n"
-		  "\tEngineInstance: %d\n",
-		  pf->asid, pf->vfid, pf->pdata, upper_32_bits(pf->page_addr),
-		  lower_32_bits(pf->page_addr),
-		  pf->fault_type, pf->access_type, pf->fault_level,
-		  pf->engine_class, xe_hw_engine_class_to_str(pf->engine_class),
-		  pf->engine_instance);
-}
-
-#define PF_MSG_LEN_DW	4
-
-static bool get_pagefault(struct pf_queue *pf_queue, struct pagefault *pf)
-{
-	const struct xe_guc_pagefault_desc *desc;
-	bool ret = false;
-
-	spin_lock_irq(&pf_queue->lock);
-	if (pf_queue->tail != pf_queue->head) {
-		desc = (const struct xe_guc_pagefault_desc *)
-			(pf_queue->data + pf_queue->tail);
-
-		pf->fault_level = FIELD_GET(PFD_FAULT_LEVEL, desc->dw0);
-		pf->trva_fault = FIELD_GET(XE2_PFD_TRVA_FAULT, desc->dw0);
-		pf->engine_class = FIELD_GET(PFD_ENG_CLASS, desc->dw0);
-		pf->engine_instance = FIELD_GET(PFD_ENG_INSTANCE, desc->dw0);
-		pf->pdata = FIELD_GET(PFD_PDATA_HI, desc->dw1) <<
-			PFD_PDATA_HI_SHIFT;
-		pf->pdata |= FIELD_GET(PFD_PDATA_LO, desc->dw0);
-		pf->asid = FIELD_GET(PFD_ASID, desc->dw1);
-		pf->vfid = FIELD_GET(PFD_VFID, desc->dw2);
-		pf->access_type = FIELD_GET(PFD_ACCESS_TYPE, desc->dw2);
-		pf->fault_type = FIELD_GET(PFD_FAULT_TYPE, desc->dw2);
-		pf->page_addr = (u64)(FIELD_GET(PFD_VIRTUAL_ADDR_HI, desc->dw3)) <<
-			PFD_VIRTUAL_ADDR_HI_SHIFT;
-		pf->page_addr |= FIELD_GET(PFD_VIRTUAL_ADDR_LO, desc->dw2) <<
-			PFD_VIRTUAL_ADDR_LO_SHIFT;
-
-		pf_queue->tail = (pf_queue->tail + PF_MSG_LEN_DW) %
-			pf_queue->num_dw;
-		ret = true;
-	}
-	spin_unlock_irq(&pf_queue->lock);
-
-	return ret;
-}
-
-static bool pf_queue_full(struct pf_queue *pf_queue)
-{
-	lockdep_assert_held(&pf_queue->lock);
-
-	return CIRC_SPACE(pf_queue->head, pf_queue->tail,
-			  pf_queue->num_dw) <=
-		PF_MSG_LEN_DW;
-}
-
-int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len)
-{
-	struct xe_gt *gt = guc_to_gt(guc);
-	struct pf_queue *pf_queue;
-	unsigned long flags;
-	u32 asid;
-	bool full;
-
-	if (unlikely(len != PF_MSG_LEN_DW))
-		return -EPROTO;
-
-	asid = FIELD_GET(PFD_ASID, msg[1]);
-	pf_queue = gt->usm.pf_queue + (asid % NUM_PF_QUEUE);
-
-	/*
-	 * The below logic doesn't work unless PF_QUEUE_NUM_DW % PF_MSG_LEN_DW == 0
-	 */
-	xe_gt_assert(gt, !(pf_queue->num_dw % PF_MSG_LEN_DW));
-
-	spin_lock_irqsave(&pf_queue->lock, flags);
-	full = pf_queue_full(pf_queue);
-	if (!full) {
-		memcpy(pf_queue->data + pf_queue->head, msg, len * sizeof(u32));
-		pf_queue->head = (pf_queue->head + len) %
-			pf_queue->num_dw;
-		queue_work(gt->usm.pf_wq, &pf_queue->worker);
-	} else {
-		xe_gt_warn(gt, "PageFault Queue full, shouldn't be possible\n");
-	}
-	spin_unlock_irqrestore(&pf_queue->lock, flags);
-
-	return full ? -ENOSPC : 0;
-}
-
-#define USM_QUEUE_MAX_RUNTIME_MS	20
-
-static void pf_queue_work_func(struct work_struct *w)
-{
-	struct pf_queue *pf_queue = container_of(w, struct pf_queue, worker);
-	struct xe_gt *gt = pf_queue->gt;
-	struct xe_guc_pagefault_reply reply = {};
-	struct pagefault pf = {};
-	unsigned long threshold;
-	int ret;
-
-	threshold = jiffies + msecs_to_jiffies(USM_QUEUE_MAX_RUNTIME_MS);
-
-	while (get_pagefault(pf_queue, &pf)) {
-		ret = handle_pagefault(gt, &pf);
-		if (unlikely(ret)) {
-			print_pagefault(gt, &pf);
-			pf.fault_unsuccessful = 1;
-			xe_gt_dbg(gt, "Fault response: Unsuccessful %pe\n", ERR_PTR(ret));
-		}
-
-		reply.dw0 = FIELD_PREP(PFR_VALID, 1) |
-			FIELD_PREP(PFR_SUCCESS, pf.fault_unsuccessful) |
-			FIELD_PREP(PFR_REPLY, PFR_ACCESS) |
-			FIELD_PREP(PFR_DESC_TYPE, FAULT_RESPONSE_DESC) |
-			FIELD_PREP(PFR_ASID, pf.asid);
-
-		reply.dw1 = FIELD_PREP(PFR_VFID, pf.vfid) |
-			FIELD_PREP(PFR_ENG_INSTANCE, pf.engine_instance) |
-			FIELD_PREP(PFR_ENG_CLASS, pf.engine_class) |
-			FIELD_PREP(PFR_PDATA, pf.pdata);
-
-		send_pagefault_reply(&gt->uc.guc, &reply);
-
-		if (time_after(jiffies, threshold) &&
-		    pf_queue->tail != pf_queue->head) {
-			queue_work(gt->usm.pf_wq, w);
-			break;
-		}
-	}
-}
-
-static void acc_queue_work_func(struct work_struct *w);
-
-static void pagefault_fini(void *arg)
-{
-	struct xe_gt *gt = arg;
-	struct xe_device *xe = gt_to_xe(gt);
-
-	if (!xe->info.has_usm)
-		return;
-
-	destroy_workqueue(gt->usm.acc_wq);
-	destroy_workqueue(gt->usm.pf_wq);
-}
-
-static int xe_alloc_pf_queue(struct xe_gt *gt, struct pf_queue *pf_queue)
-{
-	struct xe_device *xe = gt_to_xe(gt);
-	xe_dss_mask_t all_dss;
-	int num_dss, num_eus;
-
-	bitmap_or(all_dss, gt->fuse_topo.g_dss_mask, gt->fuse_topo.c_dss_mask,
-		  XE_MAX_DSS_FUSE_BITS);
-
-	num_dss = bitmap_weight(all_dss, XE_MAX_DSS_FUSE_BITS);
-	num_eus = bitmap_weight(gt->fuse_topo.eu_mask_per_dss,
-				XE_MAX_EU_FUSE_BITS) * num_dss;
-
-	/*
-	 * user can issue separate page faults per EU and per CS
-	 *
-	 * XXX: Multiplier required as compute UMD are getting PF queue errors
-	 * without it. Follow on why this multiplier is required.
-	 */
-#define PF_MULTIPLIER	8
-	pf_queue->num_dw =
-		(num_eus + XE_NUM_HW_ENGINES) * PF_MSG_LEN_DW * PF_MULTIPLIER;
-	pf_queue->num_dw = roundup_pow_of_two(pf_queue->num_dw);
-#undef PF_MULTIPLIER
-
-	pf_queue->gt = gt;
-	pf_queue->data = devm_kcalloc(xe->drm.dev, pf_queue->num_dw,
-				      sizeof(u32), GFP_KERNEL);
-	if (!pf_queue->data)
-		return -ENOMEM;
-
-	spin_lock_init(&pf_queue->lock);
-	INIT_WORK(&pf_queue->worker, pf_queue_work_func);
-
-	return 0;
-}
-
-int xe_gt_pagefault_init(struct xe_gt *gt)
-{
-	struct xe_device *xe = gt_to_xe(gt);
-	int i, ret = 0;
-
-	if (!xe->info.has_usm)
-		return 0;
-
-	for (i = 0; i < NUM_PF_QUEUE; ++i) {
-		ret = xe_alloc_pf_queue(gt, &gt->usm.pf_queue[i]);
-		if (ret)
-			return ret;
-	}
-	for (i = 0; i < NUM_ACC_QUEUE; ++i) {
-		gt->usm.acc_queue[i].gt = gt;
-		spin_lock_init(&gt->usm.acc_queue[i].lock);
-		INIT_WORK(&gt->usm.acc_queue[i].worker, acc_queue_work_func);
-	}
-
-	gt->usm.pf_wq = alloc_workqueue("xe_gt_page_fault_work_queue",
-					WQ_UNBOUND | WQ_HIGHPRI, NUM_PF_QUEUE);
-	if (!gt->usm.pf_wq)
-		return -ENOMEM;
-
-	gt->usm.acc_wq = alloc_workqueue("xe_gt_access_counter_work_queue",
-					 WQ_UNBOUND | WQ_HIGHPRI,
-					 NUM_ACC_QUEUE);
-	if (!gt->usm.acc_wq) {
-		destroy_workqueue(gt->usm.pf_wq);
-		return -ENOMEM;
-	}
-
-	return devm_add_action_or_reset(xe->drm.dev, pagefault_fini, gt);
-}
-
-void xe_gt_pagefault_reset(struct xe_gt *gt)
-{
-	struct xe_device *xe = gt_to_xe(gt);
-	int i;
-
-	if (!xe->info.has_usm)
-		return;
-
-	for (i = 0; i < NUM_PF_QUEUE; ++i) {
-		spin_lock_irq(&gt->usm.pf_queue[i].lock);
-		gt->usm.pf_queue[i].head = 0;
-		gt->usm.pf_queue[i].tail = 0;
-		spin_unlock_irq(&gt->usm.pf_queue[i].lock);
-	}
-
-	for (i = 0; i < NUM_ACC_QUEUE; ++i) {
-		spin_lock(&gt->usm.acc_queue[i].lock);
-		gt->usm.acc_queue[i].head = 0;
-		gt->usm.acc_queue[i].tail = 0;
-		spin_unlock(&gt->usm.acc_queue[i].lock);
-	}
-}
-
-static int granularity_in_byte(int val)
-{
-	switch (val) {
-	case 0:
-		return SZ_128K;
-	case 1:
-		return SZ_2M;
-	case 2:
-		return SZ_16M;
-	case 3:
-		return SZ_64M;
-	default:
-		return 0;
-	}
-}
-
-static int sub_granularity_in_byte(int val)
-{
-	return (granularity_in_byte(val) / 32);
-}
-
-static void print_acc(struct xe_gt *gt, struct acc *acc)
-{
-	xe_gt_warn(gt, "Access counter request:\n"
-		   "\tType: %s\n"
-		   "\tASID: %d\n"
-		   "\tVFID: %d\n"
-		   "\tEngine: %d:%d\n"
-		   "\tGranularity: 0x%x KB Region/ %d KB sub-granularity\n"
-		   "\tSub_Granularity Vector: 0x%08x\n"
-		   "\tVA Range base: 0x%016llx\n",
-		   acc->access_type ? "AC_NTFY_VAL" : "AC_TRIG_VAL",
-		   acc->asid, acc->vfid, acc->engine_class, acc->engine_instance,
-		   granularity_in_byte(acc->granularity) / SZ_1K,
-		   sub_granularity_in_byte(acc->granularity) / SZ_1K,
-		   acc->sub_granularity, acc->va_range_base);
-}
-
-static struct xe_vma *get_acc_vma(struct xe_vm *vm, struct acc *acc)
-{
-	u64 page_va = acc->va_range_base + (ffs(acc->sub_granularity) - 1) *
-		sub_granularity_in_byte(acc->granularity);
-
-	return xe_vm_find_overlapping_vma(vm, page_va, SZ_4K);
-}
-
-static int handle_acc(struct xe_gt *gt, struct acc *acc)
-{
-	struct xe_device *xe = gt_to_xe(gt);
-	struct xe_tile *tile = gt_to_tile(gt);
-	struct drm_exec exec;
-	struct xe_vm *vm;
-	struct xe_vma *vma;
-	int ret = 0;
-
-	/* We only support ACC_TRIGGER at the moment */
-	if (acc->access_type != ACC_TRIGGER)
-		return -EINVAL;
-
-	vm = asid_to_vm(xe, acc->asid);
-	if (IS_ERR(vm))
-		return PTR_ERR(vm);
-
-	down_read(&vm->lock);
-
-	/* Lookup VMA */
-	vma = get_acc_vma(vm, acc);
-	if (!vma) {
-		ret = -EINVAL;
-		goto unlock_vm;
-	}
-
-	trace_xe_vma_acc(vma);
-
-	/* Userptr or null can't be migrated, nothing to do */
-	if (xe_vma_has_no_bo(vma))
-		goto unlock_vm;
-
-	/* Lock VM and BOs dma-resv */
-	drm_exec_init(&exec, 0, 0);
-	drm_exec_until_all_locked(&exec) {
-		ret = xe_pf_begin(&exec, vma, true, tile->mem.vram);
-		drm_exec_retry_on_contention(&exec);
-		if (ret)
-			break;
-	}
-
-	drm_exec_fini(&exec);
-unlock_vm:
-	up_read(&vm->lock);
-	xe_vm_put(vm);
-
-	return ret;
-}
-
-#define make_u64(hi__, low__)  ((u64)(hi__) << 32 | (u64)(low__))
-
-#define ACC_MSG_LEN_DW        4
-
-static bool get_acc(struct acc_queue *acc_queue, struct acc *acc)
-{
-	const struct xe_guc_acc_desc *desc;
-	bool ret = false;
-
-	spin_lock(&acc_queue->lock);
-	if (acc_queue->tail != acc_queue->head) {
-		desc = (const struct xe_guc_acc_desc *)
-			(acc_queue->data + acc_queue->tail);
-
-		acc->granularity = FIELD_GET(ACC_GRANULARITY, desc->dw2);
-		acc->sub_granularity = FIELD_GET(ACC_SUBG_HI, desc->dw1) << 31 |
-			FIELD_GET(ACC_SUBG_LO, desc->dw0);
-		acc->engine_class = FIELD_GET(ACC_ENG_CLASS, desc->dw1);
-		acc->engine_instance = FIELD_GET(ACC_ENG_INSTANCE, desc->dw1);
-		acc->asid =  FIELD_GET(ACC_ASID, desc->dw1);
-		acc->vfid =  FIELD_GET(ACC_VFID, desc->dw2);
-		acc->access_type = FIELD_GET(ACC_TYPE, desc->dw0);
-		acc->va_range_base = make_u64(desc->dw3 & ACC_VIRTUAL_ADDR_RANGE_HI,
-					      desc->dw2 & ACC_VIRTUAL_ADDR_RANGE_LO);
-
-		acc_queue->tail = (acc_queue->tail + ACC_MSG_LEN_DW) %
-				  ACC_QUEUE_NUM_DW;
-		ret = true;
-	}
-	spin_unlock(&acc_queue->lock);
-
-	return ret;
-}
-
-static void acc_queue_work_func(struct work_struct *w)
-{
-	struct acc_queue *acc_queue = container_of(w, struct acc_queue, worker);
-	struct xe_gt *gt = acc_queue->gt;
-	struct acc acc = {};
-	unsigned long threshold;
-	int ret;
-
-	threshold = jiffies + msecs_to_jiffies(USM_QUEUE_MAX_RUNTIME_MS);
-
-	while (get_acc(acc_queue, &acc)) {
-		ret = handle_acc(gt, &acc);
-		if (unlikely(ret)) {
-			print_acc(gt, &acc);
-			xe_gt_warn(gt, "ACC: Unsuccessful %pe\n", ERR_PTR(ret));
-		}
-
-		if (time_after(jiffies, threshold) &&
-		    acc_queue->tail != acc_queue->head) {
-			queue_work(gt->usm.acc_wq, w);
-			break;
-		}
-	}
-}
-
-static bool acc_queue_full(struct acc_queue *acc_queue)
-{
-	lockdep_assert_held(&acc_queue->lock);
-
-	return CIRC_SPACE(acc_queue->head, acc_queue->tail, ACC_QUEUE_NUM_DW) <=
-		ACC_MSG_LEN_DW;
-}
-
-int xe_guc_access_counter_notify_handler(struct xe_guc *guc, u32 *msg, u32 len)
-{
-	struct xe_gt *gt = guc_to_gt(guc);
-	struct acc_queue *acc_queue;
-	u32 asid;
-	bool full;
-
-	/*
-	 * The below logic doesn't work unless ACC_QUEUE_NUM_DW % ACC_MSG_LEN_DW == 0
-	 */
-	BUILD_BUG_ON(ACC_QUEUE_NUM_DW % ACC_MSG_LEN_DW);
-
-	if (unlikely(len != ACC_MSG_LEN_DW))
-		return -EPROTO;
-
-	asid = FIELD_GET(ACC_ASID, msg[1]);
-	acc_queue = &gt->usm.acc_queue[asid % NUM_ACC_QUEUE];
-
-	spin_lock(&acc_queue->lock);
-	full = acc_queue_full(acc_queue);
-	if (!full) {
-		memcpy(acc_queue->data + acc_queue->head, msg,
-		       len * sizeof(u32));
-		acc_queue->head = (acc_queue->head + len) % ACC_QUEUE_NUM_DW;
-		queue_work(gt->usm.acc_wq, &acc_queue->worker);
-	} else {
-		xe_gt_warn(gt, "ACC Queue full, dropping ACC\n");
-	}
-	spin_unlock(&acc_queue->lock);
-
-	return full ? -ENOSPC : 0;
-}
diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.h b/drivers/gpu/drm/xe/xe_gt_pagefault.h
deleted file mode 100644
index 839c065a5e4c..000000000000
--- a/drivers/gpu/drm/xe/xe_gt_pagefault.h
+++ /dev/null
@@ -1,19 +0,0 @@
-/* SPDX-License-Identifier: MIT */
-/*
- * Copyright © 2022 Intel Corporation
- */
-
-#ifndef _XE_GT_PAGEFAULT_H_
-#define _XE_GT_PAGEFAULT_H_
-
-#include <linux/types.h>
-
-struct xe_gt;
-struct xe_guc;
-
-int xe_gt_pagefault_init(struct xe_gt *gt);
-void xe_gt_pagefault_reset(struct xe_gt *gt);
-int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len);
-int xe_guc_access_counter_notify_handler(struct xe_guc *guc, u32 *msg, u32 len);
-
-#endif	/* _XE_GT_PAGEFAULT_ */
diff --git a/drivers/gpu/drm/xe/xe_gt_types.h b/drivers/gpu/drm/xe/xe_gt_types.h
index dfd4a16da5f0..48ff0c491ccc 100644
--- a/drivers/gpu/drm/xe/xe_gt_types.h
+++ b/drivers/gpu/drm/xe/xe_gt_types.h
@@ -239,71 +239,6 @@ struct xe_gt {
 		 * operations (e.g. mmigrations, fixing page tables)
 		 */
 		u16 reserved_bcs_instance;
-		/** @usm.pf_wq: page fault work queue, unbound, high priority */
-		struct workqueue_struct *pf_wq;
-		/** @usm.acc_wq: access counter work queue, unbound, high priority */
-		struct workqueue_struct *acc_wq;
-		/**
-		 * @usm.pf_queue: Page fault queue used to sync faults so faults can
-		 * be processed not under the GuC CT lock. The queue is sized so
-		 * it can sync all possible faults (1 per physical engine).
-		 * Multiple queues exists for page faults from different VMs are
-		 * be processed in parallel.
-		 */
-		struct pf_queue {
-			/** @usm.pf_queue.gt: back pointer to GT */
-			struct xe_gt *gt;
-			/** @usm.pf_queue.data: data in the page fault queue */
-			u32 *data;
-			/**
-			 * @usm.pf_queue.num_dw: number of DWORDS in the page
-			 * fault queue. Dynamically calculated based on the number
-			 * of compute resources available.
-			 */
-			u32 num_dw;
-			/**
-			 * @usm.pf_queue.tail: tail pointer in DWs for page fault queue,
-			 * moved by worker which processes faults (consumer).
-			 */
-			u16 tail;
-			/**
-			 * @usm.pf_queue.head: head pointer in DWs for page fault queue,
-			 * moved by G2H handler (producer).
-			 */
-			u16 head;
-			/** @usm.pf_queue.lock: protects page fault queue */
-			spinlock_t lock;
-			/** @usm.pf_queue.worker: to process page faults */
-			struct work_struct worker;
-#define NUM_PF_QUEUE	4
-		} pf_queue[NUM_PF_QUEUE];
-		/**
-		 * @usm.acc_queue: Same as page fault queue, cannot process access
-		 * counters under CT lock.
-		 */
-		struct acc_queue {
-			/** @usm.acc_queue.gt: back pointer to GT */
-			struct xe_gt *gt;
-#define ACC_QUEUE_NUM_DW	128
-			/** @usm.acc_queue.data: data in the page fault queue */
-			u32 data[ACC_QUEUE_NUM_DW];
-			/**
-			 * @usm.acc_queue.tail: tail pointer in DWs for access counter queue,
-			 * moved by worker which processes counters
-			 * (consumer).
-			 */
-			u16 tail;
-			/**
-			 * @usm.acc_queue.head: head pointer in DWs for access counter queue,
-			 * moved by G2H handler (producer).
-			 */
-			u16 head;
-			/** @usm.acc_queue.lock: protects page fault queue */
-			spinlock_t lock;
-			/** @usm.acc_queue.worker: to process access counters */
-			struct work_struct worker;
-#define NUM_ACC_QUEUE	4
-		} acc_queue[NUM_ACC_QUEUE];
 	} usm;
 
 	/** @ordered_wq: used to serialize GT resets and TDRs */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 08/11] drm/xe: Fine grained page fault locking
  2025-08-06  6:22 [PATCH 00/11] Pagefault refactor, fine grained fault locking, threaded prefetch Matthew Brost
                   ` (6 preceding siblings ...)
  2025-08-06  6:22 ` [PATCH 07/11] drm/xe: Remove unused GT page fault code Matthew Brost
@ 2025-08-06  6:22 ` Matthew Brost
  2025-08-06  6:22 ` [PATCH 09/11] drm/xe: Allow prefetch-only VM bind IOCTLs to use VM read lock Matthew Brost
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-06  6:22 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, himal.prasad.ghimiray, francois.dugast,
	michal.mrozek

Enable page faults to be serviced while holding vm->lock in read mode.

Introduce additional locks to:
 - Ensure only one page fault thread services a given range or VMA
 - Serialize SVM garbage collection
 - Protect SVM range insertion and removal

While these locks may contend during page faults, expensive operations
like migration can now run in parallel within a single VM.

In addition to new locking, ranges must be reference-counted after
lookup, as another thread could immediately remove them from the GPU SVM
tree, potentially dropping the last reference.

Lastly, decouple the VM’s ASID from the page fault queue selection to
allow parallel page fault handling within the same VM.

Lays the groundwork for prefetch IOCTLs to use threaded migration too.

Signed-off-by: Matthew Brost <atthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_device_types.h |  2 +
 drivers/gpu/drm/xe/xe_hmm.c          |  4 +-
 drivers/gpu/drm/xe/xe_pagefault.c    | 33 +++++++---
 drivers/gpu/drm/xe/xe_svm.c          | 96 +++++++++++++++++++---------
 drivers/gpu/drm/xe/xe_svm.h          | 38 +++++++++++
 drivers/gpu/drm/xe/xe_vm.c           | 88 +++++++++++++++++++++----
 drivers/gpu/drm/xe/xe_vm.h           |  2 +
 drivers/gpu/drm/xe/xe_vm_types.h     | 24 ++++++-
 8 files changed, 230 insertions(+), 57 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 6aa119026ce9..02b91a698500 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -393,6 +393,8 @@ struct xe_device {
 		struct xarray asid_to_vm;
 		/** @usm.next_asid: next ASID, used to cyclical alloc asids */
 		u32 next_asid;
+		/** @usm.current_pf_queue: current page fault queue */
+		u32 current_pf_queue;
 		/** @usm.lock: protects UM state */
 		struct rw_semaphore lock;
 		/** @usm.pf_wq: page fault work queue, unbound, high priority */
diff --git a/drivers/gpu/drm/xe/xe_hmm.c b/drivers/gpu/drm/xe/xe_hmm.c
index 57b71956ddf4..c9d53ffae843 100644
--- a/drivers/gpu/drm/xe/xe_hmm.c
+++ b/drivers/gpu/drm/xe/xe_hmm.c
@@ -136,7 +136,7 @@ static void xe_hmm_userptr_set_mapped(struct xe_userptr_vma *uvma)
 	struct xe_userptr *userptr = &uvma->userptr;
 	struct xe_vm *vm = xe_vma_vm(&uvma->vma);
 
-	lockdep_assert_held_write(&vm->lock);
+	lockdep_assert(xe_vma_userptr_lockdep(uvma));
 	lockdep_assert_held(&vm->userptr.notifier_lock);
 
 	mutex_lock(&userptr->unmap_mutex);
@@ -154,7 +154,7 @@ void xe_hmm_userptr_unmap(struct xe_userptr_vma *uvma)
 	struct xe_device *xe = vm->xe;
 
 	if (!lockdep_is_held_type(&vm->userptr.notifier_lock, 0) &&
-	    !lockdep_is_held_type(&vm->lock, 0) &&
+	    !xe_vma_userptr_lockdep(uvma) &&
 	    !(vma->gpuva.flags & XE_VMA_DESTROYED)) {
 		/* Don't unmap in exec critical section. */
 		xe_vm_assert_held(vm);
diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
index 474412c21ec3..95d2eb8566fb 100644
--- a/drivers/gpu/drm/xe/xe_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_pagefault.c
@@ -79,7 +79,7 @@ static int xe_pagefault_handle_vma(struct xe_gt *gt, struct xe_vma *vma,
 	ktime_t end = 0;
 	int err;
 
-	lockdep_assert_held_write(&vm->lock);
+	lockdep_assert_held(&vm->lock);
 
 	xe_gt_stats_incr(gt, XE_GT_STATS_ID_VMA_PAGEFAULT_COUNT, 1);
 	xe_gt_stats_incr(gt, XE_GT_STATS_ID_VMA_PAGEFAULT_KB,
@@ -87,6 +87,8 @@ static int xe_pagefault_handle_vma(struct xe_gt *gt, struct xe_vma *vma,
 
 	trace_xe_vma_pagefault(vma);
 
+	guard(mutex)(&vma->fault_lock);
+
 	/* Check if VMA is valid, opportunistic check only */
 	if (xe_vm_has_valid_gpu_mapping(tile, vma->tile_present,
 					vma->tile_invalidated) && !atomic)
@@ -122,10 +124,13 @@ static int xe_pagefault_handle_vma(struct xe_gt *gt, struct xe_vma *vma,
 			goto unlock_dma_resv;
 		}
 	}
+	drm_exec_fini(&exec);
 
 	dma_fence_wait(fence, false);
 	dma_fence_put(fence);
 
+	return err;
+
 unlock_dma_resv:
 	drm_exec_fini(&exec);
 	if (err == -EAGAIN)
@@ -172,10 +177,7 @@ static int xe_pagefault_service(struct xe_pagefault *pf)
 	if (IS_ERR(vm))
 		return PTR_ERR(vm);
 
-	/*
-	 * TODO: Change to read lock? Using write lock for simplicity.
-	 */
-	down_write(&vm->lock);
+	down_read(&vm->lock);
 
 	if (xe_vm_is_closed(vm)) {
 		err = -ENOENT;
@@ -199,7 +201,7 @@ static int xe_pagefault_service(struct xe_pagefault *pf)
 unlock_vm:
 	if (!err)
 		vm->usm.last_fault_vma = vma;
-	up_write(&vm->lock);
+	up_read(&vm->lock);
 	xe_vm_put(vm);
 
 	return err;
@@ -404,6 +406,19 @@ static bool xe_pagefault_queue_full(struct xe_pagefault_queue *pf_queue)
 		xe_pagefault_entry_size();
 }
 
+/*
+ * This function can race with multiple page fault producers, but worst case we
+ * stick a page fault on the same queue for consumption.
+ */
+static int xe_pagefault_queue_index(struct xe_device *xe)
+{
+	u32 old_pf_queue = READ_ONCE(xe->usm.current_pf_queue);
+
+	WRITE_ONCE(xe->usm.current_pf_queue, (old_pf_queue + 1));
+
+	return old_pf_queue % XE_PAGEFAULT_QUEUE_COUNT;
+}
+
 /**
  * xe_pagefault_handler() - Page fault handler
  * @xe: xe device instance
@@ -416,8 +431,8 @@ static bool xe_pagefault_queue_full(struct xe_pagefault_queue *pf_queue)
  */
 int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf)
 {
-	struct xe_pagefault_queue *pf_queue = xe->usm.pf_queue +
-		(pf->consumer.asid % XE_PAGEFAULT_QUEUE_COUNT);
+	int queue_index = xe_pagefault_queue_index(xe);
+	struct xe_pagefault_queue *pf_queue = xe->usm.pf_queue + queue_index;
 	unsigned long flags;
 	bool full;
 
@@ -431,7 +446,7 @@ int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf)
 	} else {
 		drm_warn(&xe->drm,
 			 "PageFault Queue (%d) full, shouldn't be possible\n",
-			 pf->consumer.asid % XE_PAGEFAULT_QUEUE_COUNT);
+			 queue_index);
 	}
 	spin_unlock_irqrestore(&pf_queue->lock, flags);
 
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 1bcf3ba3b350..6e5d9ce7c76e 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -82,6 +82,7 @@ xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
 		return NULL;
 
 	INIT_LIST_HEAD(&range->garbage_collector_link);
+	mutex_init(&range->lock);
 	xe_vm_get(gpusvm_to_vm(gpusvm));
 
 	return &range->base;
@@ -89,6 +90,7 @@ xe_svm_range_alloc(struct drm_gpusvm *gpusvm)
 
 static void xe_svm_range_free(struct drm_gpusvm_range *range)
 {
+	mutex_destroy(&to_xe_range(range)->lock);
 	xe_vm_put(range_to_vm(range));
 	kfree(range);
 }
@@ -103,11 +105,11 @@ xe_svm_garbage_collector_add_range(struct xe_vm *vm, struct xe_svm_range *range,
 
 	drm_gpusvm_range_set_unmapped(&range->base, mmu_range);
 
-	spin_lock(&vm->svm.garbage_collector.lock);
+	spin_lock(&vm->svm.garbage_collector.list_lock);
 	if (list_empty(&range->garbage_collector_link))
 		list_add_tail(&range->garbage_collector_link,
 			      &vm->svm.garbage_collector.range_list);
-	spin_unlock(&vm->svm.garbage_collector.lock);
+	spin_unlock(&vm->svm.garbage_collector.list_lock);
 
 	queue_work(xe->usm.pf_wq, &vm->svm.garbage_collector.work);
 }
@@ -238,16 +240,24 @@ static int __xe_svm_garbage_collector(struct xe_vm *vm,
 {
 	struct dma_fence *fence;
 
-	range_debug(range, "GARBAGE COLLECTOR");
+	scoped_guard(mutex, &range->lock) {
+		drm_gpusvm_range_get(&range->base);
+		range->removed = true;
 
-	xe_vm_lock(vm, false);
-	fence = xe_vm_range_unbind(vm, range);
-	xe_vm_unlock(vm);
-	if (IS_ERR(fence))
-		return PTR_ERR(fence);
-	dma_fence_put(fence);
+		range_debug(range, "GARBAGE COLLECTOR");
+
+		xe_vm_lock(vm, false);
+		fence = xe_vm_range_unbind(vm, range);
+		xe_vm_unlock(vm);
+		if (IS_ERR(fence))
+			return PTR_ERR(fence);
+		dma_fence_put(fence);
+
+		scoped_guard(mutex, &vm->svm.range_lock)
+			drm_gpusvm_range_remove(&vm->svm.gpusvm, &range->base);
+	}
 
-	drm_gpusvm_range_remove(&vm->svm.gpusvm, &range->base);
+	drm_gpusvm_range_put(&range->base);
 
 	return 0;
 }
@@ -257,12 +267,14 @@ static int xe_svm_garbage_collector(struct xe_vm *vm)
 	struct xe_svm_range *range;
 	int err;
 
-	lockdep_assert_held_write(&vm->lock);
+	lockdep_assert_held(&vm->lock);
 
 	if (xe_vm_is_closed_or_banned(vm))
 		return -ENOENT;
 
-	spin_lock(&vm->svm.garbage_collector.lock);
+	guard(mutex)(&vm->svm.garbage_collector.lock);
+
+	spin_lock(&vm->svm.garbage_collector.list_lock);
 	for (;;) {
 		range = list_first_entry_or_null(&vm->svm.garbage_collector.range_list,
 						 typeof(*range),
@@ -271,7 +283,7 @@ static int xe_svm_garbage_collector(struct xe_vm *vm)
 			break;
 
 		list_del(&range->garbage_collector_link);
-		spin_unlock(&vm->svm.garbage_collector.lock);
+		spin_unlock(&vm->svm.garbage_collector.list_lock);
 
 		err = __xe_svm_garbage_collector(vm, range);
 		if (err) {
@@ -282,9 +294,9 @@ static int xe_svm_garbage_collector(struct xe_vm *vm)
 			return err;
 		}
 
-		spin_lock(&vm->svm.garbage_collector.lock);
+		spin_lock(&vm->svm.garbage_collector.list_lock);
 	}
-	spin_unlock(&vm->svm.garbage_collector.lock);
+	spin_unlock(&vm->svm.garbage_collector.list_lock);
 
 	return 0;
 }
@@ -294,9 +306,8 @@ static void xe_svm_garbage_collector_work_func(struct work_struct *w)
 	struct xe_vm *vm = container_of(w, struct xe_vm,
 					svm.garbage_collector.work);
 
-	down_write(&vm->lock);
+	guard(rwsem_read)(&vm->lock);
 	xe_svm_garbage_collector(vm);
-	up_write(&vm->lock);
 }
 
 #if IS_ENABLED(CONFIG_DRM_XE_PAGEMAP)
@@ -560,7 +571,9 @@ int xe_svm_init(struct xe_vm *vm)
 {
 	int err;
 
-	spin_lock_init(&vm->svm.garbage_collector.lock);
+	mutex_init(&vm->svm.range_lock);
+	mutex_init(&vm->svm.garbage_collector.lock);
+	spin_lock_init(&vm->svm.garbage_collector.list_lock);
 	INIT_LIST_HEAD(&vm->svm.garbage_collector.range_list);
 	INIT_WORK(&vm->svm.garbage_collector.work,
 		  xe_svm_garbage_collector_work_func);
@@ -573,7 +586,7 @@ int xe_svm_init(struct xe_vm *vm)
 	if (err)
 		return err;
 
-	drm_gpusvm_driver_set_lock(&vm->svm.gpusvm, &vm->lock);
+	drm_gpusvm_driver_set_lock(&vm->svm.gpusvm, &vm->svm.range_lock);
 
 	return 0;
 }
@@ -600,7 +613,10 @@ void xe_svm_fini(struct xe_vm *vm)
 {
 	xe_assert(vm->xe, xe_vm_is_closed(vm));
 
-	drm_gpusvm_fini(&vm->svm.gpusvm);
+	scoped_guard(mutex, &vm->svm.range_lock)
+		drm_gpusvm_fini(&vm->svm.gpusvm);
+	mutex_destroy(&vm->svm.range_lock);
+	mutex_destroy(&vm->svm.garbage_collector.lock);
 }
 
 static bool xe_svm_range_is_valid(struct xe_svm_range *range,
@@ -804,19 +820,25 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 			IS_ENABLED(CONFIG_DRM_XE_PAGEMAP) ?
 			vm->xe->atomic_svm_timeslice_ms : 0,
 	};
-	struct xe_svm_range *range;
+	struct xe_svm_range *range = NULL;
 	struct dma_fence *fence;
 	struct xe_tile *tile = gt_to_tile(gt);
 	int migrate_try_count = ctx.devmem_only ? 3 : 1;
 	ktime_t end = 0;
-	int err;
+	int err = 0;
 
-	lockdep_assert_held_write(&vm->lock);
+	lockdep_assert_held(&vm->lock);
 	xe_assert(vm->xe, xe_vma_is_cpu_addr_mirror(vma));
 
 	xe_gt_stats_incr(gt, XE_GT_STATS_ID_SVM_PAGEFAULT_COUNT, 1);
 
 retry:
+	/* Release old range */
+	if (range) {
+		mutex_unlock(&range->lock);
+		drm_gpusvm_range_put(&range->base);
+	}
+
 	/* Always process UNMAPs first so view SVM ranges is current */
 	err = xe_svm_garbage_collector(vm);
 	if (err)
@@ -830,8 +852,13 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 	if (ctx.devmem_only && !range->base.flags.migrate_devmem)
 		return -EACCES;
 
+	mutex_lock(&range->lock);
+
+	if (xe_svm_range_is_removed(range))
+		goto retry;
+
 	if (xe_svm_range_is_valid(range, tile, ctx.devmem_only))
-		return 0;
+		goto err_out;
 
 	range_debug(range, "PAGE FAULT");
 
@@ -849,7 +876,7 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 				drm_err(&vm->xe->drm,
 					"VRAM allocation failed, retry count exceeded, asid=%u, errno=%pe\n",
 					vm->usm.asid, ERR_PTR(err));
-				return err;
+				goto err_out;
 			}
 		}
 	}
@@ -899,6 +926,8 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
 	dma_fence_put(fence);
 
 err_out:
+	mutex_unlock(&range->lock);
+	drm_gpusvm_range_put(&range->base);
 
 	return err;
 }
@@ -933,27 +962,34 @@ int xe_svm_bo_evict(struct xe_bo *bo)
 }
 
 /**
- * xe_svm_range_find_or_insert- Find or insert GPU SVM range
+ * xe_svm_range_find_or_insert() - Find or insert GPU SVM range
  * @vm: xe_vm pointer
  * @addr: address for which range needs to be found/inserted
  * @vma:  Pointer to struct xe_vma which mirrors CPU
  * @ctx: GPU SVM context
  *
  * This function finds or inserts a newly allocated a SVM range based on the
- * address.
+ * address., Take a reference to SVM range on success.
  *
  * Return: Pointer to the SVM range on success, ERR_PTR() on failure.
  */
 struct xe_svm_range *xe_svm_range_find_or_insert(struct xe_vm *vm, u64 addr,
-						 struct xe_vma *vma, struct drm_gpusvm_ctx *ctx)
+						 struct xe_vma *vma,
+						 struct drm_gpusvm_ctx *ctx)
 {
 	struct drm_gpusvm_range *r;
 
-	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm, max(addr, xe_vma_start(vma)),
-					    xe_vma_start(vma), xe_vma_end(vma), ctx);
+	guard(mutex)(&vm->svm.range_lock);
+
+	r = drm_gpusvm_range_find_or_insert(&vm->svm.gpusvm,
+					    max(addr, xe_vma_start(vma)),
+					    xe_vma_start(vma),
+					    xe_vma_end(vma), ctx);
 	if (IS_ERR(r))
 		return ERR_PTR(PTR_ERR(r));
 
+	drm_gpusvm_range_get(r);
+
 	return to_xe_range(r);
 }
 
diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
index da9a69ea0bb1..939043d6fbf1 100644
--- a/drivers/gpu/drm/xe/xe_svm.h
+++ b/drivers/gpu/drm/xe/xe_svm.h
@@ -29,6 +29,13 @@ struct xe_svm_range {
 	 * list. Protected by VM's garbage collect lock.
 	 */
 	struct list_head garbage_collector_link;
+	/**
+	 * @lock: Protects fault handler, garbage collector, and prefetch
+	 * critical sections, ensuring only one thread operates on a range at a
+	 * time. Locking order: inside vm->lock and garbage collector, outside
+	 * dma-resv locks, vm->svm.range_lock.
+	 */
+	struct mutex lock;
 	/**
 	 * @tile_present: Tile mask of binding is present for this range.
 	 * Protected by GPU SVM notifier lock.
@@ -39,8 +46,22 @@ struct xe_svm_range {
 	 * range. Protected by GPU SVM notifier lock.
 	 */
 	u8 tile_invalidated;
+	/**
+	 * @removed: Range has been removed from GPU SVM tree, protected by
+	 * @lock.
+	 */
+	bool removed;
 };
 
+/**
+ * xe_svm_range_put() - SVM range put
+ * @range: SVM range
+ */
+static inline void xe_svm_range_put(struct xe_svm_range *range)
+{
+	drm_gpusvm_range_put(&range->base);
+}
+
 /**
  * xe_svm_range_pages_valid() - SVM range pages valid
  * @range: SVM range
@@ -102,6 +123,19 @@ static inline bool xe_svm_range_has_dma_mapping(struct xe_svm_range *range)
 	return range->base.flags.has_dma_mapping;
 }
 
+/**
+ * xe_svm_range_is_removed() - SVM range is removed from GPU SVM tree
+ * @range: SVM range
+ *
+ * Return: True if SVM range is removed from GPU SVM tree, False otherwise
+ */
+static inline bool xe_svm_range_is_removed(struct xe_svm_range *range)
+{
+	lockdep_assert_held(&range->lock);
+
+	return range->removed;
+}
+
 /**
  * to_xe_range - Convert a drm_gpusvm_range pointer to a xe_svm_range
  * @r: Pointer to the drm_gpusvm_range structure
@@ -184,6 +218,10 @@ struct xe_svm_range {
 	u32 tile_invalidated;
 };
 
+static inline void xe_svm_range_put(struct xe_svm_range *range)
+{
+}
+
 static inline bool xe_svm_range_pages_valid(struct xe_svm_range *range)
 {
 	return false;
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index c9ae13c32117..2498cff58fe7 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -65,13 +65,41 @@ int xe_vma_userptr_check_repin(struct xe_userptr_vma *uvma)
 		-EAGAIN : 0;
 }
 
+#if IS_ENABLED(CONFIG_PROVE_LOCKING)
+/**
+ * xe_vma_userptr_lockdep() - Lockdep rules for modifying userptr pages
+ * @uvma: The userptr
+ *
+ * We need either:
+ * 1. vm->lock in write mode
+ * 2. vm->lock in read mode and vma->fault_lock
+ *
+ * Return: True is lockdep rules are met, False otherwise.
+ */
+bool xe_vma_userptr_lockdep(struct xe_userptr_vma *uvma)
+{
+	struct xe_vma *vma = &uvma->vma;
+	struct xe_vm *vm = xe_vma_vm(vma);
+
+	return lockdep_is_held_type(&vm->lock, 0) ||
+		(lockdep_is_held_type(&vm->lock, 1) &&
+		 lockdep_is_held_type(&vma->fault_lock, 0));
+}
+#else
+bool xe_vma_userptr_lockdep(struct xe_userptr_vma *uvma)
+{
+	return true;
+}
+#endif
+
 int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma)
 {
 	struct xe_vma *vma = &uvma->vma;
 	struct xe_vm *vm = xe_vma_vm(vma);
 	struct xe_device *xe = vm->xe;
 
-	lockdep_assert_held(&vm->lock);
+	lockdep_assert(xe_vma_userptr_lockdep(uvma));
+
 	xe_assert(xe, xe_vma_is_userptr(vma));
 
 	return xe_hmm_userptr_populate_range(uvma, false);
@@ -799,6 +827,17 @@ static int xe_vma_ops_alloc(struct xe_vma_ops *vops, bool array_of_binds)
 }
 ALLOW_ERROR_INJECTION(xe_vma_ops_alloc, ERRNO);
 
+static void xe_vma_svm_prefetch_ranges_fini(struct xe_vma_op *op)
+{
+	struct xe_svm_range *svm_range;
+	unsigned long i;
+
+	xa_for_each(&op->prefetch_range.range, i, svm_range)
+		xe_svm_range_put(svm_range);
+
+	xa_destroy(&op->prefetch_range.range);
+}
+
 static void xe_vma_svm_prefetch_op_fini(struct xe_vma_op *op)
 {
 	struct xe_vma *vma;
@@ -806,7 +845,7 @@ static void xe_vma_svm_prefetch_op_fini(struct xe_vma_op *op)
 	vma = gpuva_to_vma(op->base.prefetch.va);
 
 	if (op->base.op == DRM_GPUVA_OP_PREFETCH && xe_vma_is_cpu_addr_mirror(vma))
-		xa_destroy(&op->prefetch_range.range);
+		xe_vma_svm_prefetch_ranges_fini(op);
 }
 
 static void xe_vma_svm_prefetch_ops_fini(struct xe_vma_ops *vops)
@@ -1205,6 +1244,8 @@ static struct xe_vma *xe_vma_create(struct xe_vm *vm,
 			vma->gpuva.gem.obj = &bo->ttm.base;
 	}
 
+	mutex_init(&vma->fault_lock);
+
 	INIT_LIST_HEAD(&vma->combined_links.rebind);
 
 	INIT_LIST_HEAD(&vma->gpuva.gem.entry);
@@ -1299,6 +1340,7 @@ static void xe_vma_destroy_late(struct xe_vma *vma)
 		xe_bo_put(xe_vma_bo(vma));
 	}
 
+	mutex_destroy(&vma->fault_lock);
 	xe_vma_free(vma);
 }
 
@@ -1319,11 +1361,18 @@ static void vma_destroy_cb(struct dma_fence *fence,
 	queue_work(system_unbound_wq, &vma->destroy_work);
 }
 
+static void xe_vm_assert_write_mode_or_garbage_collector(struct xe_vm *vm)
+{
+	lockdep_assert(lockdep_is_held_type(&vm->lock, 0) ||
+		       (lockdep_is_held_type(&vm->lock, 1) &&
+			lockdep_is_held_type(&vm->svm.garbage_collector.lock, 0)));
+}
+
 static void xe_vma_destroy(struct xe_vma *vma, struct dma_fence *fence)
 {
 	struct xe_vm *vm = xe_vma_vm(vma);
 
-	lockdep_assert_held_write(&vm->lock);
+	xe_vm_assert_write_mode_or_garbage_collector(vm);
 	xe_assert(vm->xe, list_empty(&vma->combined_links.destroy));
 
 	if (xe_vma_is_userptr(vma)) {
@@ -2384,14 +2433,17 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_vma_ops *vops,
 			for_each_tile(tile, vm->xe, id)
 				tile_mask |= 0x1 << id;
 
-			xa_init_flags(&op->prefetch_range.range, XA_FLAGS_ALLOC);
+			xa_init_flags(&op->prefetch_range.range,
+				      XA_FLAGS_ALLOC);
 			op->prefetch_range.region = prefetch_region;
 			op->prefetch_range.ranges_count = 0;
 alloc_next_range:
-			svm_range = xe_svm_range_find_or_insert(vm, addr, vma, &ctx);
+			svm_range = xe_svm_range_find_or_insert(vm, addr, vma,
+								&ctx);
 
 			if (PTR_ERR(svm_range) == -ENOENT) {
-				u64 ret = xe_svm_find_vma_start(vm, addr, range_end, vma);
+				u64 ret = xe_svm_find_vma_start(vm, addr,
+								range_end, vma);
 
 				addr = ret == ULONG_MAX ? 0 : ret;
 				if (addr)
@@ -2405,21 +2457,24 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_vma_ops *vops,
 				goto unwind_prefetch_ops;
 			}
 
-			if (xe_svm_range_validate(vm, svm_range, tile_mask, !!prefetch_region)) {
-				xe_svm_range_debug(svm_range, "PREFETCH - RANGE IS VALID");
+			if (xe_svm_range_validate(vm, svm_range, tile_mask,
+						  !!prefetch_region)) {
+				xe_svm_range_debug(svm_range,
+						   "PREFETCH - RANGE IS VALID");
+				xe_svm_range_put(svm_range);
 				goto check_next_range;
 			}
 
 			err = xa_alloc(&op->prefetch_range.range,
 				       &i, svm_range, xa_limit_32b,
 				       GFP_KERNEL);
-
 			if (err)
 				goto unwind_prefetch_ops;
 
 			op->prefetch_range.ranges_count++;
 			vops->flags |= XE_VMA_OPS_FLAG_HAS_SVM_PREFETCH;
-			xe_svm_range_debug(svm_range, "PREFETCH - RANGE CREATED");
+			xe_svm_range_debug(svm_range,
+					   "PREFETCH - RANGE CREATED");
 check_next_range:
 			if (range_end > xe_svm_range_end(svm_range) &&
 			    xe_svm_range_end(svm_range) < xe_vma_end(vma)) {
@@ -2449,7 +2504,7 @@ static struct xe_vma *new_vma(struct xe_vm *vm, struct drm_gpuva_op_map *op,
 	struct xe_vma *vma;
 	int err = 0;
 
-	lockdep_assert_held_write(&vm->lock);
+	xe_vm_assert_write_mode_or_garbage_collector(vm);
 
 	if (bo) {
 		drm_exec_init(&exec, DRM_EXEC_INTERRUPTIBLE_WAIT, 0);
@@ -2529,7 +2584,7 @@ static int xe_vma_op_commit(struct xe_vm *vm, struct xe_vma_op *op)
 {
 	int err = 0;
 
-	lockdep_assert_held_write(&vm->lock);
+	xe_vm_assert_write_mode_or_garbage_collector(vm);
 
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
@@ -2597,7 +2652,7 @@ static int vm_bind_ioctl_ops_parse(struct xe_vm *vm, struct drm_gpuva_ops *ops,
 	u8 id, tile_mask = 0;
 	int err = 0;
 
-	lockdep_assert_held_write(&vm->lock);
+	xe_vm_assert_write_mode_or_garbage_collector(vm);
 
 	for_each_tile(tile, vm->xe, id)
 		tile_mask |= 0x1 << id;
@@ -2776,7 +2831,7 @@ static void xe_vma_op_unwind(struct xe_vm *vm, struct xe_vma_op *op,
 			     bool post_commit, bool prev_post_commit,
 			     bool next_post_commit)
 {
-	lockdep_assert_held_write(&vm->lock);
+	xe_vm_assert_write_mode_or_garbage_collector(vm);
 
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
@@ -2907,6 +2962,11 @@ static int prefetch_ranges(struct xe_vm *vm, struct xe_vma_op *op)
 
 	/* TODO: Threading the migration */
 	xa_for_each(&op->prefetch_range.range, i, svm_range) {
+		guard(mutex)(&svm_range->lock);
+
+		if (xe_svm_range_is_removed(svm_range))
+			return -ENODATA;
+
 		if (!region)
 			xe_svm_range_migrate_to_smem(vm, svm_range);
 
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 3475a118f666..4b097b9f981d 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -262,6 +262,8 @@ int xe_vma_userptr_pin_pages(struct xe_userptr_vma *uvma);
 
 int xe_vma_userptr_check_repin(struct xe_userptr_vma *uvma);
 
+bool xe_vma_userptr_lockdep(struct xe_userptr_vma *uvma);
+
 bool xe_vm_validate_should_retry(struct drm_exec *exec, int err, ktime_t *end);
 
 int xe_vm_lock_vma(struct drm_exec *exec, struct xe_vma *vma);
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index bed6088e1bb3..1aabdedbfa92 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -100,6 +100,12 @@ struct xe_vma {
 		struct work_struct destroy_work;
 	};
 
+	/**
+	 * @fault_lock: Synchronizes fault processing. Locking order: inside
+	 * vm->lock, outside dma-resv.
+	 */
+	struct mutex fault_lock;
+
 	/**
 	 * @tile_invalidated: Tile mask of binding are invalidated for this VMA.
 	 * protected by BO's resv and for userptrs, vm->userptr.notifier_lock in
@@ -157,13 +163,27 @@ struct xe_vm {
 	struct {
 		/** @svm.gpusvm: base GPUSVM used to track fault allocations */
 		struct drm_gpusvm gpusvm;
+		/**
+		 * @svm.range_lock: Protects insertion and removal of ranges
+		 * from GPU SVM tree.
+		 */
+		struct mutex range_lock;
 		/**
 		 * @svm.garbage_collector: Garbage collector which is used unmap
 		 * SVM range's GPU bindings and destroy the ranges.
 		 */
 		struct {
-			/** @svm.garbage_collector.lock: Protect's range list */
-			spinlock_t lock;
+			/**
+			 * @svm.garbage_collector.lock: Ensures only one thread
+			 * runs the garbage collector at a time. Locking order:
+			 * inside vm->lock, outside range->lock and dma-resv.
+			 */
+			struct mutex lock;
+			/**
+			 * @svm.garbage_collector.list_lock: Protect's range
+			 * list
+			 */
+			spinlock_t list_lock;
 			/**
 			 * @svm.garbage_collector.range_list: List of SVM ranges
 			 * in the garbage collector.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 09/11] drm/xe: Allow prefetch-only VM bind IOCTLs to use VM read lock
  2025-08-06  6:22 [PATCH 00/11] Pagefault refactor, fine grained fault locking, threaded prefetch Matthew Brost
                   ` (7 preceding siblings ...)
  2025-08-06  6:22 ` [PATCH 08/11] drm/xe: Fine grained page fault locking Matthew Brost
@ 2025-08-06  6:22 ` Matthew Brost
  2025-08-06  6:22 ` [PATCH 10/11] drm/xe: Thread prefetch of SVM ranges Matthew Brost
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-06  6:22 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, himal.prasad.ghimiray, francois.dugast,
	michal.mrozek

Prefetch-only VM bind IOCTLs do not modify VMAs after pinning userptr
pages. Downgrade vm->lock to read mode once pinning is complete.

Lays the groundwork for prefetch IOCTLs to use threaded migration.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_vm.c       | 41 ++++++++++++++++++++++++++++----
 drivers/gpu/drm/xe/xe_vm_types.h |  4 +++-
 2 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 2498cff58fe7..3211827ef6d7 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -1763,6 +1763,12 @@ struct xe_vm *xe_vm_create(struct xe_device *xe, u32 flags)
 		err = xe_svm_init(vm);
 		if (err)
 			goto err_no_resv;
+	} else {
+		/*
+		 * Avoid lockdep explosions in
+		 * xe_vm_assert_write_mode_or_garbage_collector
+		 */
+		mutex_init(&vm->svm.garbage_collector.lock);
 	}
 
 	vm_resv_obj = drm_gpuvm_resv_object_alloc(&xe->drm);
@@ -1996,6 +2002,8 @@ void xe_vm_close_and_put(struct xe_vm *vm)
 
 	if (xe_vm_in_fault_mode(vm))
 		xe_svm_fini(vm);
+	else
+		mutex_destroy(&vm->svm.garbage_collector.lock);
 
 	up_write(&vm->lock);
 
@@ -2365,10 +2373,12 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_vma_ops *vops,
 	switch (operation) {
 	case DRM_XE_VM_BIND_OP_MAP:
 	case DRM_XE_VM_BIND_OP_MAP_USERPTR:
+		vops->flags |= XE_VMA_OPS_FLAG_MODIFIES_GPUVA;
 		ops = drm_gpuvm_sm_map_ops_create(&vm->gpuvm, addr, range,
 						  obj, bo_offset_or_userptr);
 		break;
 	case DRM_XE_VM_BIND_OP_UNMAP:
+		vops->flags |= XE_VMA_OPS_FLAG_MODIFIES_GPUVA;
 		ops = drm_gpuvm_sm_unmap_ops_create(&vm->gpuvm, addr, range);
 		break;
 	case DRM_XE_VM_BIND_OP_PREFETCH:
@@ -2377,6 +2387,7 @@ vm_bind_ioctl_ops_create(struct xe_vm *vm, struct xe_vma_ops *vops,
 	case DRM_XE_VM_BIND_OP_UNMAP_ALL:
 		xe_assert(vm->xe, bo);
 
+		vops->flags |= XE_VMA_OPS_FLAG_MODIFIES_GPUVA;
 		err = xe_bo_lock(bo, true);
 		if (err)
 			return ERR_PTR(err);
@@ -2584,10 +2595,12 @@ static int xe_vma_op_commit(struct xe_vm *vm, struct xe_vma_op *op)
 {
 	int err = 0;
 
-	xe_vm_assert_write_mode_or_garbage_collector(vm);
+	lockdep_assert_held(&vm->lock);
 
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
+		xe_vm_assert_write_mode_or_garbage_collector(vm);
+
 		err |= xe_vm_insert_vma(vm, op->map.vma);
 		if (!err)
 			op->flags |= XE_VMA_OP_COMMITTED;
@@ -2597,6 +2610,8 @@ static int xe_vma_op_commit(struct xe_vm *vm, struct xe_vma_op *op)
 		u8 tile_present =
 			gpuva_to_vma(op->base.remap.unmap->va)->tile_present;
 
+		xe_vm_assert_write_mode_or_garbage_collector(vm);
+
 		prep_vma_destroy(vm, gpuva_to_vma(op->base.remap.unmap->va),
 				 true);
 		op->flags |= XE_VMA_OP_COMMITTED;
@@ -2630,6 +2645,8 @@ static int xe_vma_op_commit(struct xe_vm *vm, struct xe_vma_op *op)
 		break;
 	}
 	case DRM_GPUVA_OP_UNMAP:
+		xe_vm_assert_write_mode_or_garbage_collector(vm);
+
 		prep_vma_destroy(vm, gpuva_to_vma(op->base.unmap.va), true);
 		op->flags |= XE_VMA_OP_COMMITTED;
 		break;
@@ -2831,10 +2848,12 @@ static void xe_vma_op_unwind(struct xe_vm *vm, struct xe_vma_op *op,
 			     bool post_commit, bool prev_post_commit,
 			     bool next_post_commit)
 {
-	xe_vm_assert_write_mode_or_garbage_collector(vm);
+	lockdep_assert_held(&vm->lock);
 
 	switch (op->base.op) {
 	case DRM_GPUVA_OP_MAP:
+		xe_vm_assert_write_mode_or_garbage_collector(vm);
+
 		if (op->map.vma) {
 			prep_vma_destroy(vm, op->map.vma, post_commit);
 			xe_vma_destroy_unlocked(op->map.vma);
@@ -2844,6 +2863,8 @@ static void xe_vma_op_unwind(struct xe_vm *vm, struct xe_vma_op *op,
 	{
 		struct xe_vma *vma = gpuva_to_vma(op->base.unmap.va);
 
+		xe_vm_assert_write_mode_or_garbage_collector(vm);
+
 		if (vma) {
 			down_read(&vm->userptr.notifier_lock);
 			vma->gpuva.flags &= ~XE_VMA_DESTROYED;
@@ -2857,6 +2878,8 @@ static void xe_vma_op_unwind(struct xe_vm *vm, struct xe_vma_op *op,
 	{
 		struct xe_vma *vma = gpuva_to_vma(op->base.remap.unmap->va);
 
+		xe_vm_assert_write_mode_or_garbage_collector(vm);
+
 		if (op->remap.prev) {
 			prep_vma_destroy(vm, op->remap.prev, prev_post_commit);
 			xe_vma_destroy_unlocked(op->remap.prev);
@@ -3313,7 +3336,7 @@ static struct dma_fence *vm_bind_ioctl_ops_execute(struct xe_vm *vm,
 	struct dma_fence *fence;
 	int err;
 
-	lockdep_assert_held_write(&vm->lock);
+	lockdep_assert_held(&vm->lock);
 
 	drm_exec_init(&exec, DRM_EXEC_INTERRUPTIBLE_WAIT |
 		      DRM_EXEC_IGNORE_DUPLICATES, 0);
@@ -3587,7 +3610,7 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 	u32 num_syncs, num_ufence = 0;
 	struct xe_sync_entry *syncs = NULL;
 	struct drm_xe_vm_bind_op *bind_ops;
-	struct xe_vma_ops vops;
+	struct xe_vma_ops vops = { .flags = 0 };
 	struct dma_fence *fence;
 	int err;
 	int i;
@@ -3753,6 +3776,11 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		goto unwind_ops;
 	}
 
+	if (!(vops.flags & XE_VMA_OPS_FLAG_MODIFIES_GPUVA)) {
+		vops.flags |= XE_VMA_OPS_FLAG_DOWNGRADE_LOCK;
+		downgrade_write(&vm->lock);
+	}
+
 	err = xe_vma_ops_alloc(&vops, args->num_binds > 1);
 	if (err)
 		goto unwind_ops;
@@ -3785,7 +3813,10 @@ int xe_vm_bind_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 	for (i = 0; i < args->num_binds; ++i)
 		xe_bo_put(bos[i]);
 release_vm_lock:
-	up_write(&vm->lock);
+	if (vops.flags & XE_VMA_OPS_FLAG_DOWNGRADE_LOCK)
+		up_read(&vm->lock);
+	else
+		up_write(&vm->lock);
 put_exec_queue:
 	if (q)
 		xe_exec_queue_put(q);
diff --git a/drivers/gpu/drm/xe/xe_vm_types.h b/drivers/gpu/drm/xe/xe_vm_types.h
index 1aabdedbfa92..332822e6ee7f 100644
--- a/drivers/gpu/drm/xe/xe_vm_types.h
+++ b/drivers/gpu/drm/xe/xe_vm_types.h
@@ -481,7 +481,9 @@ struct xe_vma_ops {
 	/** @pt_update_ops: page table update operations */
 	struct xe_vm_pgtable_update_ops pt_update_ops[XE_MAX_TILES_PER_DEVICE];
 	/** @flag: signify the properties within xe_vma_ops*/
-#define XE_VMA_OPS_FLAG_HAS_SVM_PREFETCH BIT(0)
+#define XE_VMA_OPS_FLAG_HAS_SVM_PREFETCH	BIT(0)
+#define XE_VMA_OPS_FLAG_MODIFIES_GPUVA		BIT(1)
+#define XE_VMA_OPS_FLAG_DOWNGRADE_LOCK		BIT(2)
 	u32 flags;
 #ifdef TEST_VM_OPS_ERROR
 	/** @inject_error: inject error to test error handling */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 10/11] drm/xe: Thread prefetch of SVM ranges
  2025-08-06  6:22 [PATCH 00/11] Pagefault refactor, fine grained fault locking, threaded prefetch Matthew Brost
                   ` (8 preceding siblings ...)
  2025-08-06  6:22 ` [PATCH 09/11] drm/xe: Allow prefetch-only VM bind IOCTLs to use VM read lock Matthew Brost
@ 2025-08-06  6:22 ` Matthew Brost
  2025-08-28 22:55   ` Summers, Stuart
  2025-08-06  6:22 ` [PATCH 11/11] drm/xe: Add num_pf_queue modparam Matthew Brost
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 51+ messages in thread
From: Matthew Brost @ 2025-08-06  6:22 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, himal.prasad.ghimiray, francois.dugast,
	michal.mrozek

The migrate_vma_* functions are very CPU-intensive; as a result,
prefetching SVM ranges is limited by CPU performance rather than paging
copy engine bandwidth. To accelerate SVM range prefetching, the step
that calls migrate_vma_* is now threaded. Reuses the page fault work
queue for threading.

Running xe_exec_system_allocator --r prefetch-benchmark, which tests
64MB prefetches, shows an increase from ~4.35 GB/s to 12.25 GB/s with
this patch on drm-tip. Enabling high SLPC further increases throughput
to ~15.25 GB/s, and combining SLPC with ULLS raises it to ~16 GB/s. Both
of these optimizations are upcoming.

v2:
 - Use dedicated prefetch workqueue
 - Pick dedicated prefetch thread count based on profiling
 - Skip threaded prefetch for only 1 range or if prefetching to SRAM
 - Fully tested
v3:
 - Use page fault work queue

Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_pagefault.c |  30 ++++++-
 drivers/gpu/drm/xe/xe_svm.c       |  17 +++-
 drivers/gpu/drm/xe/xe_vm.c        | 144 +++++++++++++++++++++++-------
 3 files changed, 152 insertions(+), 39 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
index 95d2eb8566fb..f11c70ca6dd9 100644
--- a/drivers/gpu/drm/xe/xe_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_pagefault.c
@@ -177,7 +177,17 @@ static int xe_pagefault_service(struct xe_pagefault *pf)
 	if (IS_ERR(vm))
 		return PTR_ERR(vm);
 
-	down_read(&vm->lock);
+	/*
+	 * We can't block threaded prefetches from completing. down_read() can
+	 * block on a pending down_write(), so without a trylock here, we could
+	 * deadlock, since the page fault workqueue is shared with prefetches,
+	 * prefetches flush work items onto the same workqueue, and a
+	 * down_write() could be pending.
+	 */
+	if (!down_read_trylock(&vm->lock)) {
+		err = -EAGAIN;
+		goto put_vm;
+	}
 
 	if (xe_vm_is_closed(vm)) {
 		err = -ENOENT;
@@ -202,11 +212,23 @@ static int xe_pagefault_service(struct xe_pagefault *pf)
 	if (!err)
 		vm->usm.last_fault_vma = vma;
 	up_read(&vm->lock);
+put_vm:
 	xe_vm_put(vm);
 
 	return err;
 }
 
+static void xe_pagefault_queue_retry(struct xe_pagefault_queue *pf_queue,
+				     struct xe_pagefault *pf)
+{
+	spin_lock_irq(&pf_queue->lock);
+	if (!pf_queue->tail)
+		pf_queue->tail = pf_queue->size - xe_pagefault_entry_size();
+	else
+		pf_queue->tail -= xe_pagefault_entry_size();
+	spin_unlock_irq(&pf_queue->lock);
+}
+
 static bool xe_pagefault_queue_pop(struct xe_pagefault_queue *pf_queue,
 				   struct xe_pagefault *pf)
 {
@@ -259,7 +281,11 @@ static void xe_pagefault_queue_work(struct work_struct *w)
 			continue;
 
 		err = xe_pagefault_service(&pf);
-		if (err) {
+		if (err == -EAGAIN) {
+			xe_pagefault_queue_retry(pf_queue, &pf);
+			queue_work(gt_to_xe(pf.gt)->usm.pf_wq, w);
+			break;
+		} else if (err) {
 			xe_pagefault_print(&pf);
 			xe_gt_dbg(pf.gt, "Fault response: Unsuccessful %pe\n",
 				  ERR_PTR(err));
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 6e5d9ce7c76e..069ede2c7991 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -306,8 +306,19 @@ static void xe_svm_garbage_collector_work_func(struct work_struct *w)
 	struct xe_vm *vm = container_of(w, struct xe_vm,
 					svm.garbage_collector.work);
 
-	guard(rwsem_read)(&vm->lock);
-	xe_svm_garbage_collector(vm);
+	/*
+	 * We can't block threaded prefetches from completing. down_read() can
+	 * block on a pending down_write(), so without a trylock here, we could
+	 * deadlock, since the page fault workqueue is shared with prefetches,
+	 * prefetches flush work items onto the same workqueue, and a
+	 * down_write() could be pending.
+	 */
+	if (down_read_trylock(&vm->lock)) {
+		xe_svm_garbage_collector(vm);
+		up_read(&vm->lock);
+	} else {
+		queue_work(vm->xe->usm.pf_wq, &vm->svm.garbage_collector.work);
+	}
 }
 
 #if IS_ENABLED(CONFIG_DRM_XE_PAGEMAP)
@@ -1148,5 +1159,5 @@ int xe_devm_add(struct xe_tile *tile, struct xe_vram_region *vr)
 void xe_svm_flush(struct xe_vm *vm)
 {
 	if (xe_vm_in_fault_mode(vm))
-		flush_work(&vm->svm.garbage_collector.work);
+		__flush_workqueue(vm->xe->usm.pf_wq);
 }
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 3211827ef6d7..147b900b1f0b 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -2962,57 +2962,132 @@ static int check_ufence(struct xe_vma *vma)
 	return 0;
 }
 
-static int prefetch_ranges(struct xe_vm *vm, struct xe_vma_op *op)
+struct prefetch_thread {
+	struct work_struct work;
+	struct drm_gpusvm_ctx *ctx;
+	struct xe_vma *vma;
+	struct xe_svm_range *svm_range;
+	struct xe_tile *tile;
+	u32 region;
+	int err;
+};
+
+static void prefetch_thread_func(struct prefetch_thread *thread)
 {
-	bool devmem_possible = IS_DGFX(vm->xe) && IS_ENABLED(CONFIG_DRM_XE_PAGEMAP);
-	struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+	struct xe_vma *vma = thread->vma;
+	struct xe_vm *vm = xe_vma_vm(vma);
+	struct xe_svm_range *svm_range = thread->svm_range;
+	u32 region = thread->region;
+	struct xe_tile *tile = thread->tile;
 	int err = 0;
 
-	struct xe_svm_range *svm_range;
+	guard(mutex)(&svm_range->lock);
+
+	if (xe_svm_range_is_removed(svm_range)) {
+		thread->err = -ENODATA;
+		return;
+	}
+
+	if (!region) {
+		xe_svm_range_migrate_to_smem(vm, svm_range);
+	} else if (xe_svm_range_needs_migrate_to_vram(svm_range, vma, region)) {
+		err = xe_svm_alloc_vram(tile, svm_range, thread->ctx);
+		if (err) {
+			drm_dbg(&vm->xe->drm,
+				"VRAM allocation failed, retry from userspace, asid=%u, gpusvm=%p, errno=%pe\n",
+				vm->usm.asid, &vm->svm.gpusvm, ERR_PTR(err));
+			thread->err = -ENODATA;
+			return;
+		}
+		xe_svm_range_debug(svm_range, "PREFETCH - RANGE MIGRATED TO VRAM");
+	}
+
+	err = xe_svm_range_get_pages(vm, svm_range, thread->ctx);
+	if (err) {
+		drm_dbg(&vm->xe->drm, "Get pages failed, asid=%u, gpusvm=%p, errno=%pe\n",
+			vm->usm.asid, &vm->svm.gpusvm, ERR_PTR(err));
+		if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM)
+			err = -ENODATA;
+		thread->err = err;
+		return;
+	}
+
+	xe_svm_range_debug(svm_range, "PREFETCH - RANGE GET PAGES DONE");
+}
+
+static void prefetch_work_func(struct work_struct *w)
+{
+	struct prefetch_thread *thread =
+		container_of(w, struct prefetch_thread, work);
+
+	prefetch_thread_func(thread);
+}
+
+static int prefetch_ranges(struct xe_vm *vm, struct xe_vma_ops *vops,
+			   struct xe_vma_op *op)
+{
+	struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
+	u32 region = op->prefetch_range.region;
 	struct drm_gpusvm_ctx ctx = {};
-	struct xe_tile *tile;
+	struct prefetch_thread stack_thread;
+	struct xe_svm_range *svm_range;
+	struct prefetch_thread *prefetches;
+	bool sram = region_to_mem_type[region] == XE_PL_TT;
+	struct xe_tile *tile = sram ? xe_device_get_root_tile(vm->xe) :
+		&vm->xe->tiles[region_to_mem_type[region] - XE_PL_VRAM0];
 	unsigned long i;
-	u32 region;
+	bool devmem_possible = IS_DGFX(vm->xe) &&
+		IS_ENABLED(CONFIG_DRM_XE_PAGEMAP);
+	bool skip_threads = op->prefetch_range.ranges_count == 1 || sram ||
+		!(vops->flags & XE_VMA_OPS_FLAG_DOWNGRADE_LOCK);
+	struct prefetch_thread *thread = skip_threads ? &stack_thread : NULL;
+	int err = 0, idx = 0;
 
 	if (!xe_vma_is_cpu_addr_mirror(vma))
 		return 0;
 
-	region = op->prefetch_range.region;
+	if (!skip_threads) {
+		prefetches = kvmalloc_array(op->prefetch_range.ranges_count,
+					    sizeof(*prefetches), GFP_KERNEL);
+		if (!prefetches)
+			return -ENOMEM;
+	}
 
 	ctx.read_only = xe_vma_read_only(vma);
 	ctx.devmem_possible = devmem_possible;
 	ctx.check_pages_threshold = devmem_possible ? SZ_64K : 0;
 
-	/* TODO: Threading the migration */
 	xa_for_each(&op->prefetch_range.range, i, svm_range) {
-		guard(mutex)(&svm_range->lock);
-
-		if (xe_svm_range_is_removed(svm_range))
-			return -ENODATA;
-
-		if (!region)
-			xe_svm_range_migrate_to_smem(vm, svm_range);
+		if (!skip_threads) {
+			thread = prefetches + idx++;
+			INIT_WORK(&thread->work, prefetch_work_func);
+		}
 
-		if (xe_svm_range_needs_migrate_to_vram(svm_range, vma, region)) {
-			tile = &vm->xe->tiles[region_to_mem_type[region] - XE_PL_VRAM0];
-			err = xe_svm_alloc_vram(tile, svm_range, &ctx);
-			if (err) {
-				drm_dbg(&vm->xe->drm, "VRAM allocation failed, retry from userspace, asid=%u, gpusvm=%p, errno=%pe\n",
-					vm->usm.asid, &vm->svm.gpusvm, ERR_PTR(err));
-				return -ENODATA;
-			}
-			xe_svm_range_debug(svm_range, "PREFETCH - RANGE MIGRATED TO VRAM");
+		thread->ctx = &ctx;
+		thread->vma = vma;
+		thread->svm_range = svm_range;
+		thread->tile = tile;
+		thread->region = region;
+		thread->err = 0;
+
+		if (skip_threads) {
+			prefetch_thread_func(thread);
+			if (thread->err)
+				return thread->err;
+		} else {
+			queue_work(vm->xe->usm.pf_wq, &thread->work);
 		}
+	}
 
-		err = xe_svm_range_get_pages(vm, svm_range, &ctx);
-		if (err) {
-			drm_dbg(&vm->xe->drm, "Get pages failed, asid=%u, gpusvm=%p, errno=%pe\n",
-				vm->usm.asid, &vm->svm.gpusvm, ERR_PTR(err));
-			if (err == -EOPNOTSUPP || err == -EFAULT || err == -EPERM)
-				err = -ENODATA;
-			return err;
+	if (!skip_threads) {
+		for (i = 0; i < idx; ++i) {
+			thread = prefetches + i;
+
+			flush_work(&thread->work);
+			if (thread->err && (!err || err == -ENODATA))
+				err = thread->err;
 		}
-		xe_svm_range_debug(svm_range, "PREFETCH - RANGE GET PAGES DONE");
+		kvfree(prefetches);
 	}
 
 	return err;
@@ -3079,7 +3154,8 @@ static int op_lock_and_prep(struct drm_exec *exec, struct xe_vm *vm,
 	return err;
 }
 
-static int vm_bind_ioctl_ops_prefetch_ranges(struct xe_vm *vm, struct xe_vma_ops *vops)
+static int vm_bind_ioctl_ops_prefetch_ranges(struct xe_vm *vm,
+					     struct xe_vma_ops *vops)
 {
 	struct xe_vma_op *op;
 	int err;
@@ -3089,7 +3165,7 @@ static int vm_bind_ioctl_ops_prefetch_ranges(struct xe_vm *vm, struct xe_vma_ops
 
 	list_for_each_entry(op, &vops->list, link) {
 		if (op->base.op  == DRM_GPUVA_OP_PREFETCH) {
-			err = prefetch_ranges(vm, op);
+			err = prefetch_ranges(vm, vops, op);
 			if (err)
 				return err;
 		}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 11/11] drm/xe: Add num_pf_queue modparam
  2025-08-06  6:22 [PATCH 00/11] Pagefault refactor, fine grained fault locking, threaded prefetch Matthew Brost
                   ` (9 preceding siblings ...)
  2025-08-06  6:22 ` [PATCH 10/11] drm/xe: Thread prefetch of SVM ranges Matthew Brost
@ 2025-08-06  6:22 ` Matthew Brost
  2025-08-28 22:58   ` Summers, Stuart
  2025-08-06  6:36 ` ✗ CI.checkpatch: warning for Pagefault refactor, fine grained fault locking, threaded prefetch Patchwork
  2025-08-06  6:36 ` ✗ CI.KUnit: failure " Patchwork
  12 siblings, 1 reply; 51+ messages in thread
From: Matthew Brost @ 2025-08-06  6:22 UTC (permalink / raw)
  To: intel-xe
  Cc: thomas.hellstrom, himal.prasad.ghimiray, francois.dugast,
	michal.mrozek

Enable quick experiment to see how number of page fault queues affects
performance.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_device.c       | 15 +++++++++++++--
 drivers/gpu/drm/xe/xe_device_types.h |  6 ++++--
 drivers/gpu/drm/xe/xe_module.c       |  5 +++++
 drivers/gpu/drm/xe/xe_module.h       |  1 +
 drivers/gpu/drm/xe/xe_pagefault.c    |  8 ++++----
 drivers/gpu/drm/xe/xe_vm.c           |  3 ++-
 6 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index c7c8aee03841..47eb07e9c799 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -413,6 +413,17 @@ static void xe_device_destroy(struct drm_device *dev, void *dummy)
 	ttm_device_fini(&xe->ttm);
 }
 
+static void xe_device_parse_modparame(struct xe_device *xe)
+{
+	xe->info.force_execlist = xe_modparam.force_execlist;
+	xe->info.num_pf_queue = xe_modparam.num_pf_queue;
+	if (xe->info.num_pf_queue < 1)
+		xe->info.num_pf_queue = 1;
+	else if (xe->info.num_pf_queue > XE_PAGEFAULT_QUEUE_MAX)
+		xe->info.num_pf_queue = XE_PAGEFAULT_QUEUE_MAX;
+	xe->atomic_svm_timeslice_ms = 5;
+}
+
 struct xe_device *xe_device_create(struct pci_dev *pdev,
 				   const struct pci_device_id *ent)
 {
@@ -446,8 +457,8 @@ struct xe_device *xe_device_create(struct pci_dev *pdev,
 
 	xe->info.devid = pdev->device;
 	xe->info.revid = pdev->revision;
-	xe->info.force_execlist = xe_modparam.force_execlist;
-	xe->atomic_svm_timeslice_ms = 5;
+
+	xe_device_parse_modparame(xe);
 
 	err = xe_irq_init(xe);
 	if (err)
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index 02b91a698500..d5c5fd7972a1 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -243,6 +243,8 @@ struct xe_device {
 		u8 revid;
 		/** @info.step: stepping information for each IP */
 		struct xe_step_info step;
+		/** @info.num_pf_queue: Number of page fault queues */
+		int num_pf_queue;
 		/** @info.dma_mask_size: DMA address bits */
 		u8 dma_mask_size;
 		/** @info.vram_flags: Vram flags */
@@ -399,9 +401,9 @@ struct xe_device {
 		struct rw_semaphore lock;
 		/** @usm.pf_wq: page fault work queue, unbound, high priority */
 		struct workqueue_struct *pf_wq;
-#define XE_PAGEFAULT_QUEUE_COUNT	4
+#define XE_PAGEFAULT_QUEUE_MAX	8
 		/** @pf_queue: Page fault queues */
-		struct xe_pagefault_queue pf_queue[XE_PAGEFAULT_QUEUE_COUNT];
+		struct xe_pagefault_queue pf_queue[XE_PAGEFAULT_QUEUE_MAX];
 	} usm;
 
 	/** @pinned: pinned BO state */
diff --git a/drivers/gpu/drm/xe/xe_module.c b/drivers/gpu/drm/xe/xe_module.c
index d08338fc3bc1..0671ae9d9e5a 100644
--- a/drivers/gpu/drm/xe/xe_module.c
+++ b/drivers/gpu/drm/xe/xe_module.c
@@ -27,6 +27,7 @@
 #define DEFAULT_PROBE_DISPLAY		true
 #define DEFAULT_VRAM_BAR_SIZE		0
 #define DEFAULT_FORCE_PROBE		CONFIG_DRM_XE_FORCE_PROBE
+#define DEFAULT_NUM_PF_QUEUE		4
 #define DEFAULT_MAX_VFS			~0
 #define DEFAULT_MAX_VFS_STR		"unlimited"
 #define DEFAULT_WEDGED_MODE		1
@@ -40,6 +41,7 @@ struct xe_modparam xe_modparam = {
 	.max_vfs =		DEFAULT_MAX_VFS,
 #endif
 	.wedged_mode =		DEFAULT_WEDGED_MODE,
+	.num_pf_queue =		DEFAULT_NUM_PF_QUEUE,
 	.svm_notifier_size =	DEFAULT_SVM_NOTIFIER_SIZE,
 	/* the rest are 0 by default */
 };
@@ -93,6 +95,9 @@ MODULE_PARM_DESC(wedged_mode,
 		 "Module's default policy for the wedged mode (0=never, 1=upon-critical-errors, 2=upon-any-hang "
 		 "[default=" __stringify(DEFAULT_WEDGED_MODE) "])");
 
+module_param_named(num_pf_queue, xe_modparam.num_pf_queue, int, 0600);
+MODULE_PARM_DESC(num_pf_queue, "Number of page fault queue, default=4, min=1, max=8");
+
 static int xe_check_nomodeset(void)
 {
 	if (drm_firmware_drivers_only())
diff --git a/drivers/gpu/drm/xe/xe_module.h b/drivers/gpu/drm/xe/xe_module.h
index 5a3bfea8b7b4..36ac2151fe16 100644
--- a/drivers/gpu/drm/xe/xe_module.h
+++ b/drivers/gpu/drm/xe/xe_module.h
@@ -22,6 +22,7 @@ struct xe_modparam {
 	unsigned int max_vfs;
 #endif
 	int wedged_mode;
+	int num_pf_queue;
 	u32 svm_notifier_size;
 };
 
diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
index f11c70ca6dd9..3c69557c6aa9 100644
--- a/drivers/gpu/drm/xe/xe_pagefault.c
+++ b/drivers/gpu/drm/xe/xe_pagefault.c
@@ -373,11 +373,11 @@ int xe_pagefault_init(struct xe_device *xe)
 
 	xe->usm.pf_wq = alloc_workqueue("xe_page_fault_work_queue",
 					WQ_UNBOUND | WQ_HIGHPRI,
-					XE_PAGEFAULT_QUEUE_COUNT);
+					xe->info.num_pf_queue);
 	if (!xe->usm.pf_wq)
 		return -ENOMEM;
 
-	for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i) {
+	for (i = 0; i < xe->info.num_pf_queue; ++i) {
 		err = xe_pagefault_queue_init(xe, xe->usm.pf_queue + i);
 		if (err)
 			goto err_out;
@@ -420,7 +420,7 @@ void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
 {
 	int i;
 
-	for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i)
+	for (i = 0; i < xe->info.num_pf_queue; ++i)
 		xe_pagefault_queue_reset(xe, gt, xe->usm.pf_queue + i);
 }
 
@@ -442,7 +442,7 @@ static int xe_pagefault_queue_index(struct xe_device *xe)
 
 	WRITE_ONCE(xe->usm.current_pf_queue, (old_pf_queue + 1));
 
-	return old_pf_queue % XE_PAGEFAULT_QUEUE_COUNT;
+	return old_pf_queue % xe->info.num_pf_queue;
 }
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 147b900b1f0b..67000c4466ab 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -3039,7 +3039,8 @@ static int prefetch_ranges(struct xe_vm *vm, struct xe_vma_ops *vops,
 	bool devmem_possible = IS_DGFX(vm->xe) &&
 		IS_ENABLED(CONFIG_DRM_XE_PAGEMAP);
 	bool skip_threads = op->prefetch_range.ranges_count == 1 || sram ||
-		!(vops->flags & XE_VMA_OPS_FLAG_DOWNGRADE_LOCK);
+		!(vops->flags & XE_VMA_OPS_FLAG_DOWNGRADE_LOCK) ||
+		vm->xe->info.num_pf_queue == 1;
 	struct prefetch_thread *thread = skip_threads ? &stack_thread : NULL;
 	int err = 0, idx = 0;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* ✗ CI.checkpatch: warning for Pagefault refactor, fine grained fault locking, threaded prefetch
  2025-08-06  6:22 [PATCH 00/11] Pagefault refactor, fine grained fault locking, threaded prefetch Matthew Brost
                   ` (10 preceding siblings ...)
  2025-08-06  6:22 ` [PATCH 11/11] drm/xe: Add num_pf_queue modparam Matthew Brost
@ 2025-08-06  6:36 ` Patchwork
  2025-08-06  6:36 ` ✗ CI.KUnit: failure " Patchwork
  12 siblings, 0 replies; 51+ messages in thread
From: Patchwork @ 2025-08-06  6:36 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

== Series Details ==

Series: Pagefault refactor, fine grained fault locking, threaded prefetch
URL   : https://patchwork.freedesktop.org/series/152565/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
c298eac5978c38dcc62a70c0d73c91765e7cc296
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit 7ee7c0e5458c479b4896cf904e810b14b1953536
Author: Matthew Brost <matthew.brost@intel.com>
Date:   Tue Aug 5 23:22:42 2025 -0700

    drm/xe: Add num_pf_queue modparam
    
    Enable quick experiment to see how number of page fault queues affects
    performance.
    
    Signed-off-by: Matthew Brost <matthew.brost@intel.com>
+ /mt/dim checkpatch d636513f71664d86e9205b92a45bd8b21ea67a2b drm-intel
f491b29a011e drm/xe: Stub out new pagefault layer
-:25: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#25: 
new file mode 100644

-:223: WARNING:TYPO_SPELLING: 'acknowledgement' may be misspelled - perhaps 'acknowledgment'?
#223: FILE: drivers/gpu/drm/xe/xe_pagefault_types.h:100:
+		 * acknowledgement to formulate response to HW/FW interface.
 		   ^^^^^^^^^^^^^^^

total: 0 errors, 2 warnings, 0 checks, 214 lines checked
a506cb47dffa drm/xe: Implement xe_pagefault_init
5e7c9be76dc1 drm/xe: Implement xe_pagefault_reset
0789190d741c drm/xe: Implement xe_pagefault_handler
99c1fae0ffc3 drm/xe: Implement xe_pagefault_queue_work
90f57ccf361e drm/xe: Add xe_guc_pagefault layer
-:97: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#97: 
new file mode 100644

-:179: WARNING:ONE_SEMICOLON: Statements terminations use 1 semicolon
#179: FILE: drivers/gpu/drm/xe/xe_guc_pagefault.c:78:
+	pf.consumer.access_type = FIELD_GET(PFD_ACCESS_TYPE, msg[2]);;

-:240: ERROR:MISSING_SIGN_OFF: Missing Signed-off-by: line(s)

total: 1 errors, 2 warnings, 0 checks, 185 lines checked
a9748a3e20af drm/xe: Remove unused GT page fault code
-:13: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does MAINTAINERS need updating?
#13: 
deleted file mode 100644

total: 0 errors, 1 warnings, 0 checks, 71 lines checked
df42d2ed34f0 drm/xe: Fine grained page fault locking
-:752: WARNING:FROM_SIGN_OFF_MISMATCH: From:/Signed-off-by: email address mismatch: 'From: Matthew Brost <matthew.brost@intel.com>' != 'Signed-off-by: Matthew Brost <atthew.brost@intel.com>'

total: 0 errors, 1 warnings, 0 checks, 644 lines checked
b90179a168fb drm/xe: Allow prefetch-only VM bind IOCTLs to use VM read lock
78b8cb96d86e drm/xe: Thread prefetch of SVM ranges
7ee7c0e5458c drm/xe: Add num_pf_queue modparam



^ permalink raw reply	[flat|nested] 51+ messages in thread

* ✗ CI.KUnit: failure for Pagefault refactor, fine grained fault locking, threaded prefetch
  2025-08-06  6:22 [PATCH 00/11] Pagefault refactor, fine grained fault locking, threaded prefetch Matthew Brost
                   ` (11 preceding siblings ...)
  2025-08-06  6:36 ` ✗ CI.checkpatch: warning for Pagefault refactor, fine grained fault locking, threaded prefetch Patchwork
@ 2025-08-06  6:36 ` Patchwork
  12 siblings, 0 replies; 51+ messages in thread
From: Patchwork @ 2025-08-06  6:36 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe

== Series Details ==

Series: Pagefault refactor, fine grained fault locking, threaded prefetch
URL   : https://patchwork.freedesktop.org/series/152565/
State : failure

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
ERROR:root:../drivers/gpu/drm/xe/xe_vm.c: In function ‘prefetch_thread_func’:
../drivers/gpu/drm/xe/xe_vm.c:2984:32: error: ‘struct xe_svm_range’ has no member named ‘lock’
 2984 |         guard(mutex)(&svm_range->lock);
      |                                ^~
../drivers/gpu/drm/xe/xe_vm.c:2986:13: error: implicit declaration of function ‘xe_svm_range_is_removed’; did you mean ‘xe_svm_range_size’? [-Werror=implicit-function-declaration]
 2986 |         if (xe_svm_range_is_removed(svm_range)) {
      |             ^~~~~~~~~~~~~~~~~~~~~~~
      |             xe_svm_range_size
cc1: some warnings being treated as errors
make[7]: *** [../scripts/Makefile.build:287: drivers/gpu/drm/xe/xe_vm.o] Error 1
make[7]: *** Waiting for unfinished jobs....
make[6]: *** [../scripts/Makefile.build:555: drivers/gpu/drm/xe] Error 2
make[5]: *** [../scripts/Makefile.build:555: drivers/gpu/drm] Error 2
make[4]: *** [../scripts/Makefile.build:555: drivers/gpu] Error 2
make[3]: *** [../scripts/Makefile.build:555: drivers] Error 2
make[2]: *** [/kernel/Makefile:2003: .] Error 2
make[1]: *** [/kernel/Makefile:248: __sub-make] Error 2
make: *** [Makefile:248: __sub-make] Error 2

[06:36:19] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[06:36:23] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/11] drm/xe: Stub out new pagefault layer
  2025-08-06  6:22 ` [PATCH 01/11] drm/xe: Stub out new pagefault layer Matthew Brost
@ 2025-08-06 23:01   ` Summers, Stuart
  2025-08-06 23:53     ` Matthew Brost
  2025-08-27 15:29   ` Francois Dugast
  2025-08-28 20:08   ` Summers, Stuart
  2 siblings, 1 reply; 51+ messages in thread
From: Summers, Stuart @ 2025-08-06 23:01 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Brost,  Matthew
  Cc: Mrozek, Michal, Ghimiray, Himal Prasad,
	thomas.hellstrom@linux.intel.com, Dugast, Francois

Few basic comments below to start. I personally would rather this be
brought over from the existing fault handler rather than creating
something entirely new and then clobbering the older stuff - just so
the review of message format requests/replies is easier to review and
where we're deviating from the existing external interfaces
(HW/FW/GuC/etc). You already have this here though so not a huge deal.
I think most of this was in the giant blob of patches that got merged
with the initial driver, so I guess the counter argument is we can have
easy to reference historical reviews now.

On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> Stub out the new page fault layer and add kernel documentation. This
> is
> intended as a replacement for the GT page fault layer, enabling
> multiple
> producers to hook into a shared page fault consumer interface.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile             |   1 +
>  drivers/gpu/drm/xe/xe_pagefault.c       |  63 ++++++++++++
>  drivers/gpu/drm/xe/xe_pagefault.h       |  19 ++++
>  drivers/gpu/drm/xe/xe_pagefault_types.h | 125
> ++++++++++++++++++++++++
>  4 files changed, 208 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.c
>  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.h
>  create mode 100644 drivers/gpu/drm/xe/xe_pagefault_types.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile
> b/drivers/gpu/drm/xe/Makefile
> index 8e0c3412a757..6fbebafe79c9 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -93,6 +93,7 @@ xe-y += xe_bb.o \
>         xe_nvm.o \
>         xe_oa.o \
>         xe_observation.o \
> +       xe_pagefault.o \
>         xe_pat.o \
>         xe_pci.o \
>         xe_pcode.o \
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> b/drivers/gpu/drm/xe/xe_pagefault.c
> new file mode 100644
> index 000000000000..3ce0e8d74b9d
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> @@ -0,0 +1,63 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#include "xe_pagefault.h"
> +#include "xe_pagefault_types.h"
> +
> +/**
> + * DOC: Xe page faults
> + *
> + * Xe page faults are handled in two layers. The producer layer
> interacts with
> + * hardware or firmware to receive and parse faults into struct
> xe_pagefault,
> + * then forwards them to the consumer. The consumer layer services
> the faults
> + * (e.g., memory migration, page table updates) and acknowledges the
> result back
> + * to the producer, which then forwards the results to the hardware
> or firmware.
> + * The consumer uses a page fault queue sized to absorb all
> potential faults and
> + * a multi-threaded worker to process them. Multiple producers are
> supported,
> + * with a single shared consumer.
> + */
> +
> +/**
> + * xe_pagefault_init() - Page fault init
> + * @xe: xe device instance
> + *
> + * Initialize Xe page fault state. Must be done after reading fuses.
> + *
> + * Return: 0 on Success, errno on failure
> + */
> +int xe_pagefault_init(struct xe_device *xe)
> +{
> +       /* TODO - implement */
> +       return 0;
> +}
> +
> +/**
> + * xe_pagefault_reset() - Page fault reset for a GT
> + * @xe: xe device instance
> + * @gt: GT being reset
> + *
> + * Reset the Xe page fault state for a GT; that is, squash any
> pending faults on
> + * the GT.
> + */
> +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
> +{
> +       /* TODO - implement */
> +}
> +
> +/**
> + * xe_pagefault_handler() - Page fault handler
> + * @xe: xe device instance
> + * @pf: Page fault
> + *
> + * Sink the page fault to a queue (i.e., a memory buffer) and queue
> a worker to
> + * service it. Safe to be called from IRQ or process context.
> Reclaim safe.
> + *
> + * Return: 0 on success, errno on failure
> + */
> +int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault
> *pf)
> +{
> +       /* TODO - implement */
> +       return 0;
> +}
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.h
> b/drivers/gpu/drm/xe/xe_pagefault.h
> new file mode 100644
> index 000000000000..bd0cdf9ed37f
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_pagefault.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#ifndef _XE_PAGEFAULT_H_
> +#define _XE_PAGEFAULT_H_
> +
> +struct xe_device;
> +struct xe_gt;
> +struct xe_pagefault;
> +
> +int xe_pagefault_init(struct xe_device *xe);
> +
> +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt);
> +
> +int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault
> *pf);
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h
> b/drivers/gpu/drm/xe/xe_pagefault_types.h
> new file mode 100644
> index 000000000000..fcff84f93dd8
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
> @@ -0,0 +1,125 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#ifndef _XE_PAGEFAULT_TYPES_H_
> +#define _XE_PAGEFAULT_TYPES_H_
> +
> +#include <linux/workqueue.h>
> +
> +struct xe_pagefault;
> +struct xe_gt;

Nit: Maybe reverse these structs to be in alphabetical order

> +
> +/** enum xe_pagefault_access_type - Xe page fault access type */
> +enum xe_pagefault_access_type {
> +       /** @XE_PAGEFAULT_ACCESS_TYPE_READ: Read access type */
> +       XE_PAGEFAULT_ACCESS_TYPE_READ   = 0,
> +       /** @XE_PAGEFAULT_ACCESS_TYPE_WRITE: Write access type */
> +       XE_PAGEFAULT_ACCESS_TYPE_WRITE  = 1,
> +       /** @XE_PAGEFAULT_ACCESS_TYPE_ATOMIC: Atomic access type */
> +       XE_PAGEFAULT_ACCESS_TYPE_ATOMIC = 2,
> +};
> +
> +/** enum xe_pagefault_type - Xe page fault type */
> +enum xe_pagefault_type {
> +       /** @XE_PAGEFAULT_TYPE_NOT_PRESENT: Not present */
> +       XE_PAGEFAULT_TYPE_NOT_PRESENT           = 0,
> +       /** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Write access
> violation */
> +       XE_PAGEFAULT_WRITE_ACCESS_VIOLATION     = 1,
> +       /** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Atomic access
> violation */

XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION ->
XE_PAGEFAULT_ACCESS_TYPE_ATOMIC

> +       XE_PAGEFAULT_ATOMIC_ACCESS_VIOLATION    = 2,
> +};
> +
> +/** struct xe_pagefault_ops - Xe pagefault ops (producer) */
> +struct xe_pagefault_ops {
> +       /**
> +        * @ack_fault: Ack fault
> +        * @pf: Page fault
> +        * @err: Error state of fault
> +        *
> +        * Page fault producer receives acknowledgment from the
> consumer and
> +        * sends the result to the HW/FW interface.
> +        */
> +       void (*ack_fault)(struct xe_pagefault *pf, int err);
> +};
> +
> +/**
> + * struct xe_pagefault - Xe page fault
> + *
> + * Generic page fault structure for communication between producer
> and consumer.
> + * Carefully sized to be 64 bytes.
> + */
> +struct xe_pagefault {
> +       /**
> +        * @gt: GT of fault
> +        *
> +        * XXX: We may want to decouple the GT from individual
> faults, as it's
> +        * unclear whether future platforms will always have a GT for
> all page
> +        * fault producers. Internally, the GT is used for stats,
> identifying
> +        * the appropriate VRAM region, and locating the migration
> queue.
> +        * Leaving this as-is for now, but we can revisit later to
> see if we
> +        * can convert it to use the Xe device pointer instead.
> +        */

What if instead of assuming the GT stays static and we eventually
remove if we have some new HW abstraction layer that isn't a GT but
still uses the page fault, we instead push to have said theoretical
abstraction layer overload the GT here like we're doing with primary
and media today. Then we can keep the interface here simple and just
leave this in there, or change in the future if that doesn't make sense
without the suggestive comment?

> +       struct xe_gt *gt;
> +       /**
> +        * @consumer: State for the software handling the fault.
> Populated by
> +        * the producer and may be modified by the consumer to
> communicate
> +        * information back to the producer upon fault
> acknowledgment.
> +        */
> +       struct {
> +               /** @consumer.page_addr: address of page fault */
> +               u64 page_addr;
> +               /** @consumer.asid: address space ID */
> +               u32 asid;

Can we just call this an ID instead of a pasid or asid? I.e. the ID
could be anything, not strictly process-bound.

> +               /** @consumer.access_type: access type */
> +               u8 access_type;
> +               /** @consumer.fault_type: fault type */
> +               u8 fault_type;
> +#define XE_PAGEFAULT_LEVEL_NACK                0xff    /* Producer
> indicates nack fault */
> +               /** @consumer.fault_level: fault level */
> +               u8 fault_level;
> +               /** @consumer.engine_class: engine class */
> +               u8 engine_class;
> +               /** consumer.reserved: reserved bits for future
> expansion */
> +               u64 reserved;

What about engine instance? Or is that going to overload reserved here?

Thanks,
Stuart

> +       } consumer;
> +       /**
> +        * @producer: State for the producer (i.e., HW/FW interface).
> Populated
> +        * by the producer and should not be modified—or even
> inspected—by the
> +        * consumer, except for calling operations.
> +        */
> +       struct {
> +               /** @producer.private: private pointer */
> +               void *private;
> +               /** @producer.ops: operations */
> +               const struct xe_pagefault_ops *ops;
> +#define XE_PAGEFAULT_PRODUCER_MSG_LEN_DW       4
> +               /**
> +                * producer.msg: page fault message, used by producer
> in fault
> +                * acknowledgement to formulate response to HW/FW
> interface.
> +                */
> +               u32 msg[XE_PAGEFAULT_PRODUCER_MSG_LEN_DW];
> +       } producer;
> +};
> +
> +/** struct xe_pagefault_queue: Xe pagefault queue (consumer) */
> +struct xe_pagefault_queue {
> +       /**
> +        * @data: Data in queue containing struct xe_pagefault,
> protected by
> +        * @lock
> +        */
> +       void *data;
> +       /** @size: Size of queue in bytes */
> +       u32 size;
> +       /** @head: Head pointer in bytes, moved by producer,
> protected by @lock */
> +       u32 head;
> +       /** @tail: Tail pointer in bytes, moved by consumer,
> protected by @lock */
> +       u32 tail;
> +       /** @lock: protects page fault queue */
> +       spinlock_t lock;
> +       /** @worker: to process page faults */
> +       struct work_struct worker;
> +};
> +
> +#endif


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/11] drm/xe: Implement xe_pagefault_init
  2025-08-06  6:22 ` [PATCH 02/11] drm/xe: Implement xe_pagefault_init Matthew Brost
@ 2025-08-06 23:08   ` Summers, Stuart
  2025-08-06 23:59     ` Matthew Brost
  2025-08-27 16:30   ` Francois Dugast
  2025-08-28 20:10   ` Summers, Stuart
  2 siblings, 1 reply; 51+ messages in thread
From: Summers, Stuart @ 2025-08-06 23:08 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Brost,  Matthew
  Cc: Mrozek, Michal, Ghimiray, Himal Prasad,
	thomas.hellstrom@linux.intel.com, Dugast, Francois

On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> Create pagefault queues and initialize them.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device.c       |  5 ++
>  drivers/gpu/drm/xe/xe_device_types.h |  6 ++
>  drivers/gpu/drm/xe/xe_pagefault.c    | 93
> +++++++++++++++++++++++++++-
>  3 files changed, 102 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.c
> b/drivers/gpu/drm/xe/xe_device.c
> index 57edbc63da6f..c7c8aee03841 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -50,6 +50,7 @@
>  #include "xe_nvm.h"
>  #include "xe_oa.h"
>  #include "xe_observation.h"
> +#include "xe_pagefault.h"
>  #include "xe_pat.h"
>  #include "xe_pcode.h"
>  #include "xe_pm.h"
> @@ -890,6 +891,10 @@ int xe_device_probe(struct xe_device *xe)
>         if (err)
>                 return err;
>  
> +       err = xe_pagefault_init(xe);
> +       if (err)
> +               return err;
> +
>         xe_nvm_init(xe);
>  
>         err = xe_heci_gsc_init(xe);
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> b/drivers/gpu/drm/xe/xe_device_types.h
> index 01e8fa0d2f9f..6aa119026ce9 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -17,6 +17,7 @@
>  #include "xe_lmtt_types.h"
>  #include "xe_memirq_types.h"
>  #include "xe_oa_types.h"
> +#include "xe_pagefault_types.h"
>  #include "xe_platform_types.h"
>  #include "xe_pmu_types.h"
>  #include "xe_pt_types.h"
> @@ -394,6 +395,11 @@ struct xe_device {
>                 u32 next_asid;
>                 /** @usm.lock: protects UM state */
>                 struct rw_semaphore lock;
> +               /** @usm.pf_wq: page fault work queue, unbound, high
> priority */
> +               struct workqueue_struct *pf_wq;
> +#define XE_PAGEFAULT_QUEUE_COUNT       4
> +               /** @pf_queue: Page fault queues */
> +               struct xe_pagefault_queue
> pf_queue[XE_PAGEFAULT_QUEUE_COUNT];
>         } usm;
>  
>         /** @pinned: pinned BO state */
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> b/drivers/gpu/drm/xe/xe_pagefault.c
> index 3ce0e8d74b9d..14304c41eb23 100644
> --- a/drivers/gpu/drm/xe/xe_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> @@ -3,6 +3,10 @@
>   * Copyright © 2025 Intel Corporation
>   */
>  
> +#include <drm/drm_managed.h>
> +
> +#include "xe_device.h"
> +#include "xe_gt_types.h"
>  #include "xe_pagefault.h"
>  #include "xe_pagefault_types.h"
>  
> @@ -19,6 +23,71 @@
>   * with a single shared consumer.
>   */
>  
> +static int xe_pagefault_entry_size(void)
> +{
> +       return roundup_pow_of_two(sizeof(struct xe_pagefault));

Nice thanks!

> +}
> +
> +static void xe_pagefault_queue_work(struct work_struct *w)
> +{
> +       /* TODO: Implement */
> +}
> +
> +static int xe_pagefault_queue_init(struct xe_device *xe,
> +                                  struct xe_pagefault_queue
> *pf_queue)
> +{
> +       struct xe_gt *gt;
> +       int total_num_eus = 0;
> +       u8 id;
> +
> +       for_each_gt(gt, xe, id) {
> +               xe_dss_mask_t all_dss;
> +               int num_dss, num_eus;
> +
> +               bitmap_or(all_dss, gt->fuse_topo.g_dss_mask,
> +                         gt->fuse_topo.c_dss_mask,
> XE_MAX_DSS_FUSE_BITS);
> +
> +               num_dss = bitmap_weight(all_dss,
> XE_MAX_DSS_FUSE_BITS);
> +               num_eus = bitmap_weight(gt-
> >fuse_topo.eu_mask_per_dss,
> +                                       XE_MAX_EU_FUSE_BITS) *
> num_dss;
> +
> +               total_num_eus += num_eus;

I'm behind on that patch I had posted a while back to update this
algorithm :(. Want to pull that calculation in here directly so we can
remove the PF_MULTIPLIER you have below?

See https://patchwork.freedesktop.org/patch/651415/?series=148523&rev=1

I can also rework that on top of this if you'd prefer, either way is
fine with me.

Thanks,
Stuart

> +       }
> +
> +       xe_assert(xe, total_num_eus);
> +
> +       /*
> +        * user can issue separate page faults per EU and per CS
> +        *
> +        * XXX: Multiplier required as compute UMD are getting PF
> queue errors
> +        * without it. Follow on why this multiplier is required.
> +        */
> +#define PF_MULTIPLIER  8
> +       pf_queue->size = (total_num_eus + XE_NUM_HW_ENGINES) *
> +               xe_pagefault_entry_size() * PF_MULTIPLIER;
> +       pf_queue->size = roundup_pow_of_two(pf_queue->size);
> +#undef PF_MULTIPLIER
> +
> +       drm_dbg(&xe->drm, "xe_pagefault_entry_size=%d,
> total_num_eus=%d, pf_queue->size=%u",
> +               xe_pagefault_entry_size(), total_num_eus, pf_queue-
> >size);
> +
> +       pf_queue->data = devm_kzalloc(xe->drm.dev, pf_queue->size,
> GFP_KERNEL);
> +       if (!pf_queue->data)
> +               return -ENOMEM;
> +
> +       spin_lock_init(&pf_queue->lock);
> +       INIT_WORK(&pf_queue->worker, xe_pagefault_queue_work);
> +
> +       return 0;
> +}
> +
> +static void xe_pagefault_fini(void *arg)
> +{
> +       struct xe_device *xe = arg;
> +
> +       destroy_workqueue(xe->usm.pf_wq);
> +}
> +
>  /**
>   * xe_pagefault_init() - Page fault init
>   * @xe: xe device instance
> @@ -29,8 +98,28 @@
>   */
>  int xe_pagefault_init(struct xe_device *xe)
>  {
> -       /* TODO - implement */
> -       return 0;
> +       int err, i;
> +
> +       if (!xe->info.has_usm)
> +               return 0;
> +
> +       xe->usm.pf_wq = alloc_workqueue("xe_page_fault_work_queue",
> +                                       WQ_UNBOUND | WQ_HIGHPRI,
> +                                       XE_PAGEFAULT_QUEUE_COUNT);
> +       if (!xe->usm.pf_wq)
> +               return -ENOMEM;
> +
> +       for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i) {
> +               err = xe_pagefault_queue_init(xe, xe->usm.pf_queue +
> i);
> +               if (err)
> +                       goto err_out;
> +       }
> +
> +       return devm_add_action_or_reset(xe->drm.dev,
> xe_pagefault_fini, xe);
> +
> +err_out:
> +       destroy_workqueue(xe->usm.pf_wq);
> +       return err;
>  }
>  
>  /**


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/11] drm/xe: Implement xe_pagefault_reset
  2025-08-06  6:22 ` [PATCH 03/11] drm/xe: Implement xe_pagefault_reset Matthew Brost
@ 2025-08-06 23:16   ` Summers, Stuart
  2025-08-07  0:12     ` Matthew Brost
  0 siblings, 1 reply; 51+ messages in thread
From: Summers, Stuart @ 2025-08-06 23:16 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Brost,  Matthew
  Cc: Mrozek, Michal, Ghimiray, Himal Prasad,
	thomas.hellstrom@linux.intel.com, Dugast, Francois

On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> Squash any pending faults on the GT being reset by setting the GT
> field
> in struct xe_pagefault to NULL.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_gt.c        |  2 ++
>  drivers/gpu/drm/xe/xe_pagefault.c | 23 ++++++++++++++++++++++-
>  2 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> index 390394bbaadc..5aa03f89a062 100644
> --- a/drivers/gpu/drm/xe/xe_gt.c
> +++ b/drivers/gpu/drm/xe/xe_gt.c
> @@ -50,6 +50,7 @@
>  #include "xe_map.h"
>  #include "xe_migrate.h"
>  #include "xe_mmio.h"
> +#include "xe_pagefault.h"
>  #include "xe_pat.h"
>  #include "xe_pm.h"
>  #include "xe_mocs.h"
> @@ -846,6 +847,7 @@ static int gt_reset(struct xe_gt *gt)
>  
>         xe_uc_gucrc_disable(&gt->uc);
>         xe_uc_stop_prepare(&gt->uc);
> +       xe_pagefault_reset(gt_to_xe(gt), gt);

Can we just pass the GT in here and then extrapolate xe from there? I
realize you're thinking of dropping the GT piece, but maybe we can
change the parameters around at that time. Just feels weird passing
these both in at this point.

>         xe_gt_pagefault_reset(gt);
>  
>         xe_uc_stop(&gt->uc);
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> b/drivers/gpu/drm/xe/xe_pagefault.c
> index 14304c41eb23..aef389e51612 100644
> --- a/drivers/gpu/drm/xe/xe_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> @@ -122,6 +122,24 @@ int xe_pagefault_init(struct xe_device *xe)
>         return err;
>  }
>  
> +static void xe_pagefault_queue_reset(struct xe_device *xe, struct
> xe_gt *gt,
> +                                    struct xe_pagefault_queue
> *pf_queue)
> +{
> +       u32 i;
> +
> +       /* Squash all pending faults on the GT */
> +
> +       spin_lock_irq(&pf_queue->lock);
> +       for (i = pf_queue->tail; i != pf_queue->head;
> +            i = (i + xe_pagefault_entry_size()) % pf_queue->size) {

Should we add a check in here that pf_queue->head is some multiple of
xe_pagefault_entry_size and pf_queue->size is aligned to
xe_pagefault_entry_size()?

> +               struct xe_pagefault *pf = pf_queue->data + i;
> +
> +               if (pf->gt == gt)
> +                       pf->gt = NULL;

Not sure I fully get the intent here... so we loop back around from
TAIL to HEAD and clear all of the GTs in pf_queue->data for each one?
Is the expectation that each entry in the pf_queue has the same GT or
is NULL? And then setting to NULL is a way we can abstract out the GT?

Still getting through the series, so appologize if this is also
answered later in the series...

Thanks,
Stuart

> +       }
> +       spin_unlock_irq(&pf_queue->lock);
> +}
> +
>  /**
>   * xe_pagefault_reset() - Page fault reset for a GT
>   * @xe: xe device instance
> @@ -132,7 +150,10 @@ int xe_pagefault_init(struct xe_device *xe)
>   */
>  void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
>  {
> -       /* TODO - implement */
> +       int i;
> +
> +       for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i)
> +               xe_pagefault_queue_reset(xe, gt, xe->usm.pf_queue +
> i);
>  }
>  
>  /**


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/11] drm/xe: Stub out new pagefault layer
  2025-08-06 23:01   ` Summers, Stuart
@ 2025-08-06 23:53     ` Matthew Brost
  2025-08-07 17:20       ` Summers, Stuart
  0 siblings, 1 reply; 51+ messages in thread
From: Matthew Brost @ 2025-08-06 23:53 UTC (permalink / raw)
  To: Summers, Stuart
  Cc: intel-xe@lists.freedesktop.org, Mrozek, Michal,
	Ghimiray, Himal Prasad, thomas.hellstrom@linux.intel.com,
	Dugast, Francois

On Wed, Aug 06, 2025 at 05:01:12PM -0600, Summers, Stuart wrote:
> Few basic comments below to start. I personally would rather this be
> brought over from the existing fault handler rather than creating
> something entirely new and then clobbering the older stuff - just so
> the review of message format requests/replies is easier to review and
> where we're deviating from the existing external interfaces
> (HW/FW/GuC/etc). You already have this here though so not a huge deal.
> I think most of this was in the giant blob of patches that got merged
> with the initial driver, so I guess the counter argument is we can have
> easy to reference historical reviews now.
> 

Yes, page fault code is largely just a big blob from the original Xe
patch that wasn't the most well thought out code. We still have that
history in the tree, just git blame won't work, so you'd need to know
where to look if want that.

I don't think there is a great way to pull this over, unless patches 2-7
are squashed into a single patch + a couple of 'git mv' are used.

> On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> > Stub out the new page fault layer and add kernel documentation. This
> > is
> > intended as a replacement for the GT page fault layer, enabling
> > multiple
> > producers to hook into a shared page fault consumer interface.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile             |   1 +
> >  drivers/gpu/drm/xe/xe_pagefault.c       |  63 ++++++++++++
> >  drivers/gpu/drm/xe/xe_pagefault.h       |  19 ++++
> >  drivers/gpu/drm/xe/xe_pagefault_types.h | 125
> > ++++++++++++++++++++++++
> >  4 files changed, 208 insertions(+)
> >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.c
> >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.h
> >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault_types.h
> > 
> > diff --git a/drivers/gpu/drm/xe/Makefile
> > b/drivers/gpu/drm/xe/Makefile
> > index 8e0c3412a757..6fbebafe79c9 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -93,6 +93,7 @@ xe-y += xe_bb.o \
> >         xe_nvm.o \
> >         xe_oa.o \
> >         xe_observation.o \
> > +       xe_pagefault.o \
> >         xe_pat.o \
> >         xe_pci.o \
> >         xe_pcode.o \
> > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> > b/drivers/gpu/drm/xe/xe_pagefault.c
> > new file mode 100644
> > index 000000000000..3ce0e8d74b9d
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > @@ -0,0 +1,63 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2025 Intel Corporation
> > + */
> > +
> > +#include "xe_pagefault.h"
> > +#include "xe_pagefault_types.h"
> > +
> > +/**
> > + * DOC: Xe page faults
> > + *
> > + * Xe page faults are handled in two layers. The producer layer
> > interacts with
> > + * hardware or firmware to receive and parse faults into struct
> > xe_pagefault,
> > + * then forwards them to the consumer. The consumer layer services
> > the faults
> > + * (e.g., memory migration, page table updates) and acknowledges the
> > result back
> > + * to the producer, which then forwards the results to the hardware
> > or firmware.
> > + * The consumer uses a page fault queue sized to absorb all
> > potential faults and
> > + * a multi-threaded worker to process them. Multiple producers are
> > supported,
> > + * with a single shared consumer.
> > + */
> > +
> > +/**
> > + * xe_pagefault_init() - Page fault init
> > + * @xe: xe device instance
> > + *
> > + * Initialize Xe page fault state. Must be done after reading fuses.
> > + *
> > + * Return: 0 on Success, errno on failure
> > + */
> > +int xe_pagefault_init(struct xe_device *xe)
> > +{
> > +       /* TODO - implement */
> > +       return 0;
> > +}
> > +
> > +/**
> > + * xe_pagefault_reset() - Page fault reset for a GT
> > + * @xe: xe device instance
> > + * @gt: GT being reset
> > + *
> > + * Reset the Xe page fault state for a GT; that is, squash any
> > pending faults on
> > + * the GT.
> > + */
> > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
> > +{
> > +       /* TODO - implement */
> > +}
> > +
> > +/**
> > + * xe_pagefault_handler() - Page fault handler
> > + * @xe: xe device instance
> > + * @pf: Page fault
> > + *
> > + * Sink the page fault to a queue (i.e., a memory buffer) and queue
> > a worker to
> > + * service it. Safe to be called from IRQ or process context.
> > Reclaim safe.
> > + *
> > + * Return: 0 on success, errno on failure
> > + */
> > +int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault
> > *pf)
> > +{
> > +       /* TODO - implement */
> > +       return 0;
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_pagefault.h
> > b/drivers/gpu/drm/xe/xe_pagefault.h
> > new file mode 100644
> > index 000000000000..bd0cdf9ed37f
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_pagefault.h
> > @@ -0,0 +1,19 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2025 Intel Corporation
> > + */
> > +
> > +#ifndef _XE_PAGEFAULT_H_
> > +#define _XE_PAGEFAULT_H_
> > +
> > +struct xe_device;
> > +struct xe_gt;
> > +struct xe_pagefault;
> > +
> > +int xe_pagefault_init(struct xe_device *xe);
> > +
> > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt);
> > +
> > +int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault
> > *pf);
> > +
> > +#endif
> > diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h
> > b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > new file mode 100644
> > index 000000000000..fcff84f93dd8
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > @@ -0,0 +1,125 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2025 Intel Corporation
> > + */
> > +
> > +#ifndef _XE_PAGEFAULT_TYPES_H_
> > +#define _XE_PAGEFAULT_TYPES_H_
> > +
> > +#include <linux/workqueue.h>
> > +
> > +struct xe_pagefault;
> > +struct xe_gt;
> 
> Nit: Maybe reverse these structs to be in alphabetical order
> 

Yes, that is the preferred style. Will fix.

> > +
> > +/** enum xe_pagefault_access_type - Xe page fault access type */
> > +enum xe_pagefault_access_type {
> > +       /** @XE_PAGEFAULT_ACCESS_TYPE_READ: Read access type */
> > +       XE_PAGEFAULT_ACCESS_TYPE_READ   = 0,
> > +       /** @XE_PAGEFAULT_ACCESS_TYPE_WRITE: Write access type */
> > +       XE_PAGEFAULT_ACCESS_TYPE_WRITE  = 1,
> > +       /** @XE_PAGEFAULT_ACCESS_TYPE_ATOMIC: Atomic access type */
> > +       XE_PAGEFAULT_ACCESS_TYPE_ATOMIC = 2,
> > +};
> > +
> > +/** enum xe_pagefault_type - Xe page fault type */
> > +enum xe_pagefault_type {
> > +       /** @XE_PAGEFAULT_TYPE_NOT_PRESENT: Not present */
> > +       XE_PAGEFAULT_TYPE_NOT_PRESENT           = 0,
> > +       /** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Write access
> > violation */
> > +       XE_PAGEFAULT_WRITE_ACCESS_VIOLATION     = 1,
> > +       /** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Atomic access
> > violation */
> 
> XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION ->
> XE_PAGEFAULT_ACCESS_TYPE_ATOMIC
> 

The intended prefix here is 'XE_PAGEFAULT_TYPE_' to normalize the naming
with 'enum xe_pagefault_type'.

> > +       XE_PAGEFAULT_ATOMIC_ACCESS_VIOLATION    = 2,
> > +};
> > +
> > +/** struct xe_pagefault_ops - Xe pagefault ops (producer) */
> > +struct xe_pagefault_ops {
> > +       /**
> > +        * @ack_fault: Ack fault
> > +        * @pf: Page fault
> > +        * @err: Error state of fault
> > +        *
> > +        * Page fault producer receives acknowledgment from the
> > consumer and
> > +        * sends the result to the HW/FW interface.
> > +        */
> > +       void (*ack_fault)(struct xe_pagefault *pf, int err);
> > +};
> > +
> > +/**
> > + * struct xe_pagefault - Xe page fault
> > + *
> > + * Generic page fault structure for communication between producer
> > and consumer.
> > + * Carefully sized to be 64 bytes.
> > + */
> > +struct xe_pagefault {
> > +       /**
> > +        * @gt: GT of fault
> > +        *
> > +        * XXX: We may want to decouple the GT from individual
> > faults, as it's
> > +        * unclear whether future platforms will always have a GT for
> > all page
> > +        * fault producers. Internally, the GT is used for stats,
> > identifying
> > +        * the appropriate VRAM region, and locating the migration
> > queue.
> > +        * Leaving this as-is for now, but we can revisit later to
> > see if we
> > +        * can convert it to use the Xe device pointer instead.
> > +        */
> 
> What if instead of assuming the GT stays static and we eventually
> remove if we have some new HW abstraction layer that isn't a GT but
> still uses the page fault, we instead push to have said theoretical
> abstraction layer overload the GT here like we're doing with primary
> and media today. Then we can keep the interface here simple and just
> leave this in there, or change in the future if that doesn't make sense
> without the suggestive comment?
>

I can remove this comment, as it adds some confusion. Hopefully, we
always have a GT. I was just speculating about future cases where we
might not have one. From a purely interface perspective, it would be
ideal to completely decouple the GT here.
 
> > +       struct xe_gt *gt;
> > +       /**
> > +        * @consumer: State for the software handling the fault.
> > Populated by
> > +        * the producer and may be modified by the consumer to
> > communicate
> > +        * information back to the producer upon fault
> > acknowledgment.
> > +        */
> > +       struct {
> > +               /** @consumer.page_addr: address of page fault */
> > +               u64 page_addr;
> > +               /** @consumer.asid: address space ID */
> > +               u32 asid;
> 
> Can we just call this an ID instead of a pasid or asid? I.e. the ID
> could be anything, not strictly process-bound.
> 

I think the idea here is that this serves as the ID for our reverse VM
lookup mechanism in the KMD. We call it ASID throughout the codebase
today, so we’re stuck with the name—though it may or may not have any
actual meaning in hardware, depending on the producer. For example, if
the producer receives a fault based on a queue ID, we’d look up the
queue and then pass in q->vm.asid.

We could even have the producer look up the VM directly, if preferred,
and just pass that over. However, that would require a few more bits
here and might introduce lifetime issues—for example, we’d have to
refcount the VM.

> > +               /** @consumer.access_type: access type */
> > +               u8 access_type;
> > +               /** @consumer.fault_type: fault type */
> > +               u8 fault_type;
> > +#define XE_PAGEFAULT_LEVEL_NACK                0xff    /* Producer
> > indicates nack fault */
> > +               /** @consumer.fault_level: fault level */
> > +               u8 fault_level;
> > +               /** @consumer.engine_class: engine class */
> > +               u8 engine_class;
> > +               /** consumer.reserved: reserved bits for future
> > expansion */
> > +               u64 reserved;
> 
> What about engine instance? Or is that going to overload reserved here?
> 

reserved could be used to include 'engine instance' if required, is
there for future expansion and also to have structure sized to 64 bytes.

I include fault_level, engine_class as I though both were used by [1]
but now that I looked again only fault level is used so I guess
engine_class can be pulled out too unless we want to keep for the only
place in which it is used (debug messages).

Matt

[1] https://patchwork.freedesktop.org/series/148727/

> Thanks,
> Stuart
> 
> > +       } consumer;
> > +       /**
> > +        * @producer: State for the producer (i.e., HW/FW interface).
> > Populated
> > +        * by the producer and should not be modified—or even
> > inspected—by the
> > +        * consumer, except for calling operations.
> > +        */
> > +       struct {
> > +               /** @producer.private: private pointer */
> > +               void *private;
> > +               /** @producer.ops: operations */
> > +               const struct xe_pagefault_ops *ops;
> > +#define XE_PAGEFAULT_PRODUCER_MSG_LEN_DW       4
> > +               /**
> > +                * producer.msg: page fault message, used by producer
> > in fault
> > +                * acknowledgement to formulate response to HW/FW
> > interface.
> > +                */
> > +               u32 msg[XE_PAGEFAULT_PRODUCER_MSG_LEN_DW];
> > +       } producer;
> > +};
> > +
> > +/** struct xe_pagefault_queue: Xe pagefault queue (consumer) */
> > +struct xe_pagefault_queue {
> > +       /**
> > +        * @data: Data in queue containing struct xe_pagefault,
> > protected by
> > +        * @lock
> > +        */
> > +       void *data;
> > +       /** @size: Size of queue in bytes */
> > +       u32 size;
> > +       /** @head: Head pointer in bytes, moved by producer,
> > protected by @lock */
> > +       u32 head;
> > +       /** @tail: Tail pointer in bytes, moved by consumer,
> > protected by @lock */
> > +       u32 tail;
> > +       /** @lock: protects page fault queue */
> > +       spinlock_t lock;
> > +       /** @worker: to process page faults */
> > +       struct work_struct worker;
> > +};
> > +
> > +#endif
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/11] drm/xe: Implement xe_pagefault_init
  2025-08-06 23:08   ` Summers, Stuart
@ 2025-08-06 23:59     ` Matthew Brost
  2025-08-07 18:22       ` Summers, Stuart
  0 siblings, 1 reply; 51+ messages in thread
From: Matthew Brost @ 2025-08-06 23:59 UTC (permalink / raw)
  To: Summers, Stuart
  Cc: intel-xe@lists.freedesktop.org, Mrozek, Michal,
	Ghimiray, Himal Prasad, thomas.hellstrom@linux.intel.com,
	Dugast, Francois

On Wed, Aug 06, 2025 at 05:08:18PM -0600, Summers, Stuart wrote:
> On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> > Create pagefault queues and initialize them.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_device.c       |  5 ++
> >  drivers/gpu/drm/xe/xe_device_types.h |  6 ++
> >  drivers/gpu/drm/xe/xe_pagefault.c    | 93
> > +++++++++++++++++++++++++++-
> >  3 files changed, 102 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > b/drivers/gpu/drm/xe/xe_device.c
> > index 57edbc63da6f..c7c8aee03841 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -50,6 +50,7 @@
> >  #include "xe_nvm.h"
> >  #include "xe_oa.h"
> >  #include "xe_observation.h"
> > +#include "xe_pagefault.h"
> >  #include "xe_pat.h"
> >  #include "xe_pcode.h"
> >  #include "xe_pm.h"
> > @@ -890,6 +891,10 @@ int xe_device_probe(struct xe_device *xe)
> >         if (err)
> >                 return err;
> >  
> > +       err = xe_pagefault_init(xe);
> > +       if (err)
> > +               return err;
> > +
> >         xe_nvm_init(xe);
> >  
> >         err = xe_heci_gsc_init(xe);
> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > b/drivers/gpu/drm/xe/xe_device_types.h
> > index 01e8fa0d2f9f..6aa119026ce9 100644
> > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > @@ -17,6 +17,7 @@
> >  #include "xe_lmtt_types.h"
> >  #include "xe_memirq_types.h"
> >  #include "xe_oa_types.h"
> > +#include "xe_pagefault_types.h"
> >  #include "xe_platform_types.h"
> >  #include "xe_pmu_types.h"
> >  #include "xe_pt_types.h"
> > @@ -394,6 +395,11 @@ struct xe_device {
> >                 u32 next_asid;
> >                 /** @usm.lock: protects UM state */
> >                 struct rw_semaphore lock;
> > +               /** @usm.pf_wq: page fault work queue, unbound, high
> > priority */
> > +               struct workqueue_struct *pf_wq;
> > +#define XE_PAGEFAULT_QUEUE_COUNT       4
> > +               /** @pf_queue: Page fault queues */
> > +               struct xe_pagefault_queue
> > pf_queue[XE_PAGEFAULT_QUEUE_COUNT];
> >         } usm;
> >  
> >         /** @pinned: pinned BO state */
> > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> > b/drivers/gpu/drm/xe/xe_pagefault.c
> > index 3ce0e8d74b9d..14304c41eb23 100644
> > --- a/drivers/gpu/drm/xe/xe_pagefault.c
> > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > @@ -3,6 +3,10 @@
> >   * Copyright © 2025 Intel Corporation
> >   */
> >  
> > +#include <drm/drm_managed.h>
> > +
> > +#include "xe_device.h"
> > +#include "xe_gt_types.h"
> >  #include "xe_pagefault.h"
> >  #include "xe_pagefault_types.h"
> >  
> > @@ -19,6 +23,71 @@
> >   * with a single shared consumer.
> >   */
> >  
> > +static int xe_pagefault_entry_size(void)
> > +{
> > +       return roundup_pow_of_two(sizeof(struct xe_pagefault));
> 
> Nice thanks!
> 
> > +}
> > +
> > +static void xe_pagefault_queue_work(struct work_struct *w)
> > +{
> > +       /* TODO: Implement */
> > +}
> > +
> > +static int xe_pagefault_queue_init(struct xe_device *xe,
> > +                                  struct xe_pagefault_queue
> > *pf_queue)
> > +{
> > +       struct xe_gt *gt;
> > +       int total_num_eus = 0;
> > +       u8 id;
> > +
> > +       for_each_gt(gt, xe, id) {
> > +               xe_dss_mask_t all_dss;
> > +               int num_dss, num_eus;
> > +
> > +               bitmap_or(all_dss, gt->fuse_topo.g_dss_mask,
> > +                         gt->fuse_topo.c_dss_mask,
> > XE_MAX_DSS_FUSE_BITS);
> > +
> > +               num_dss = bitmap_weight(all_dss,
> > XE_MAX_DSS_FUSE_BITS);
> > +               num_eus = bitmap_weight(gt-
> > >fuse_topo.eu_mask_per_dss,
> > +                                       XE_MAX_EU_FUSE_BITS) *
> > num_dss;
> > +
> > +               total_num_eus += num_eus;
> 
> I'm behind on that patch I had posted a while back to update this
> algorithm :(. Want to pull that calculation in here directly so we can
> remove the PF_MULTIPLIER you have below?
> 
> See https://patchwork.freedesktop.org/patch/651415/?series=148523&rev=1
> 
> I can also rework that on top of this if you'd prefer, either way is
> fine with me.
> 

Either way works for me. If you get this one in ahead of me, no issuing
picking it up in a rebase or ofc if lands first you can post on top of
this. If you are concerned about history, then the later might be
better.

Matt

> Thanks,
> Stuart
> 
> > +       }
> > +
> > +       xe_assert(xe, total_num_eus);
> > +
> > +       /*
> > +        * user can issue separate page faults per EU and per CS
> > +        *
> > +        * XXX: Multiplier required as compute UMD are getting PF
> > queue errors
> > +        * without it. Follow on why this multiplier is required.
> > +        */
> > +#define PF_MULTIPLIER  8
> > +       pf_queue->size = (total_num_eus + XE_NUM_HW_ENGINES) *
> > +               xe_pagefault_entry_size() * PF_MULTIPLIER;
> > +       pf_queue->size = roundup_pow_of_two(pf_queue->size);
> > +#undef PF_MULTIPLIER
> > +
> > +       drm_dbg(&xe->drm, "xe_pagefault_entry_size=%d,
> > total_num_eus=%d, pf_queue->size=%u",
> > +               xe_pagefault_entry_size(), total_num_eus, pf_queue-
> > >size);
> > +
> > +       pf_queue->data = devm_kzalloc(xe->drm.dev, pf_queue->size,
> > GFP_KERNEL);
> > +       if (!pf_queue->data)
> > +               return -ENOMEM;
> > +
> > +       spin_lock_init(&pf_queue->lock);
> > +       INIT_WORK(&pf_queue->worker, xe_pagefault_queue_work);
> > +
> > +       return 0;
> > +}
> > +
> > +static void xe_pagefault_fini(void *arg)
> > +{
> > +       struct xe_device *xe = arg;
> > +
> > +       destroy_workqueue(xe->usm.pf_wq);
> > +}
> > +
> >  /**
> >   * xe_pagefault_init() - Page fault init
> >   * @xe: xe device instance
> > @@ -29,8 +98,28 @@
> >   */
> >  int xe_pagefault_init(struct xe_device *xe)
> >  {
> > -       /* TODO - implement */
> > -       return 0;
> > +       int err, i;
> > +
> > +       if (!xe->info.has_usm)
> > +               return 0;
> > +
> > +       xe->usm.pf_wq = alloc_workqueue("xe_page_fault_work_queue",
> > +                                       WQ_UNBOUND | WQ_HIGHPRI,
> > +                                       XE_PAGEFAULT_QUEUE_COUNT);
> > +       if (!xe->usm.pf_wq)
> > +               return -ENOMEM;
> > +
> > +       for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i) {
> > +               err = xe_pagefault_queue_init(xe, xe->usm.pf_queue +
> > i);
> > +               if (err)
> > +                       goto err_out;
> > +       }
> > +
> > +       return devm_add_action_or_reset(xe->drm.dev,
> > xe_pagefault_fini, xe);
> > +
> > +err_out:
> > +       destroy_workqueue(xe->usm.pf_wq);
> > +       return err;
> >  }
> >  
> >  /**
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/11] drm/xe: Implement xe_pagefault_reset
  2025-08-06 23:16   ` Summers, Stuart
@ 2025-08-07  0:12     ` Matthew Brost
  2025-08-07 18:29       ` Summers, Stuart
  0 siblings, 1 reply; 51+ messages in thread
From: Matthew Brost @ 2025-08-07  0:12 UTC (permalink / raw)
  To: Summers, Stuart
  Cc: intel-xe@lists.freedesktop.org, Mrozek, Michal,
	Ghimiray, Himal Prasad, thomas.hellstrom@linux.intel.com,
	Dugast, Francois

On Wed, Aug 06, 2025 at 05:16:26PM -0600, Summers, Stuart wrote:
> On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> > Squash any pending faults on the GT being reset by setting the GT
> > field
> > in struct xe_pagefault to NULL.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_gt.c        |  2 ++
> >  drivers/gpu/drm/xe/xe_pagefault.c | 23 ++++++++++++++++++++++-
> >  2 files changed, 24 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > index 390394bbaadc..5aa03f89a062 100644
> > --- a/drivers/gpu/drm/xe/xe_gt.c
> > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > @@ -50,6 +50,7 @@
> >  #include "xe_map.h"
> >  #include "xe_migrate.h"
> >  #include "xe_mmio.h"
> > +#include "xe_pagefault.h"
> >  #include "xe_pat.h"
> >  #include "xe_pm.h"
> >  #include "xe_mocs.h"
> > @@ -846,6 +847,7 @@ static int gt_reset(struct xe_gt *gt)
> >  
> >         xe_uc_gucrc_disable(&gt->uc);
> >         xe_uc_stop_prepare(&gt->uc);
> > +       xe_pagefault_reset(gt_to_xe(gt), gt);
> 
> Can we just pass the GT in here and then extrapolate xe from there? I
> realize you're thinking of dropping the GT piece, but maybe we can
> change the parameters around at that time. Just feels weird passing
> these both in at this point.
> 

I think the style is generally layer accepts the main argument first
exported function. ofc we don't actually do that everywhere in Xe
though. Here the main argument for the layer 'xe'.

Can drop if you prefer, I don't really have strong opinion. 

> >         xe_gt_pagefault_reset(gt);
> >  
> >         xe_uc_stop(&gt->uc);
> > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> > b/drivers/gpu/drm/xe/xe_pagefault.c
> > index 14304c41eb23..aef389e51612 100644
> > --- a/drivers/gpu/drm/xe/xe_pagefault.c
> > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > @@ -122,6 +122,24 @@ int xe_pagefault_init(struct xe_device *xe)
> >         return err;
> >  }
> >  
> > +static void xe_pagefault_queue_reset(struct xe_device *xe, struct
> > xe_gt *gt,
> > +                                    struct xe_pagefault_queue
> > *pf_queue)
> > +{
> > +       u32 i;
> > +
> > +       /* Squash all pending faults on the GT */
> > +
> > +       spin_lock_irq(&pf_queue->lock);
> > +       for (i = pf_queue->tail; i != pf_queue->head;
> > +            i = (i + xe_pagefault_entry_size()) % pf_queue->size) {
> 
> Should we add a check in here that pf_queue->head is some multiple of
> xe_pagefault_entry_size and pf_queue->size is aligned to
> xe_pagefault_entry_size()?
> 

We can add some asserts, still would get an infinite loop but at least
CI we catch a bug quickly we introduced somewhere.

> > +               struct xe_pagefault *pf = pf_queue->data + i;
> > +
> > +               if (pf->gt == gt)
> > +                       pf->gt = NULL;
> 
> Not sure I fully get the intent here... so we loop back around from
> TAIL to HEAD and clear all of the GTs in pf_queue->data for each one?
> Is the expectation that each entry in the pf_queue has the same GT or
> is NULL? And then setting to NULL is a way we can abstract out the GT?
> 

This patch [1] moves the page faults queue from the GT to the device as
having a thread pool per GT makes little sense - want our thread pool to
be per device and enough threads to hit CPU<->GPU peak bus bandwidth in
what we'd expect to be common prefetch / page faults cases (e.g., 2M SVM
migrations).

When the fault queues were per GT, reset were easy, just reset all the
queues on the GT but now sense they are shared on the device and
individual GTs can reset, we squash all faults on the GT by setting the
GT to NULL. This patch [2] ignores any fault with a NULL GT in the
function xe_pagefault_queue_work.

Matt

[1] https://patchwork.freedesktop.org/patch/667318/?series=152565&rev=1
[2] https://patchwork.freedesktop.org/patch/667323/?series=152565&rev=1

> Still getting through the series, so appologize if this is also
> answered later in the series...
> 
> Thanks,
> Stuart
> 
> > +       }
> > +       spin_unlock_irq(&pf_queue->lock);
> > +}
> > +
> >  /**
> >   * xe_pagefault_reset() - Page fault reset for a GT
> >   * @xe: xe device instance
> > @@ -132,7 +150,10 @@ int xe_pagefault_init(struct xe_device *xe)
> >   */
> >  void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
> >  {
> > -       /* TODO - implement */
> > +       int i;
> > +
> > +       for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i)
> > +               xe_pagefault_queue_reset(xe, gt, xe->usm.pf_queue +
> > i);
> >  }
> >  
> >  /**
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/11] drm/xe: Stub out new pagefault layer
  2025-08-06 23:53     ` Matthew Brost
@ 2025-08-07 17:20       ` Summers, Stuart
  2025-08-07 18:10         ` Matthew Brost
  0 siblings, 1 reply; 51+ messages in thread
From: Summers, Stuart @ 2025-08-07 17:20 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Dugast, Francois,
	Ghimiray, Himal Prasad, Mrozek, Michal,
	thomas.hellstrom@linux.intel.com

On Wed, 2025-08-06 at 16:53 -0700, Matthew Brost wrote:
> On Wed, Aug 06, 2025 at 05:01:12PM -0600, Summers, Stuart wrote:
> > Few basic comments below to start. I personally would rather this
> > be
> > brought over from the existing fault handler rather than creating
> > something entirely new and then clobbering the older stuff - just
> > so
> > the review of message format requests/replies is easier to review
> > and
> > where we're deviating from the existing external interfaces
> > (HW/FW/GuC/etc). You already have this here though so not a huge
> > deal.
> > I think most of this was in the giant blob of patches that got
> > merged
> > with the initial driver, so I guess the counter argument is we can
> > have
> > easy to reference historical reviews now.
> > 
> 
> Yes, page fault code is largely just a big blob from the original Xe
> patch that wasn't the most well thought out code. We still have that
> history in the tree, just git blame won't work, so you'd need to know
> where to look if want that.
> 
> I don't think there is a great way to pull this over, unless patches
> 2-7
> are squashed into a single patch + a couple of 'git mv' are used.

No definitely don't think that's worth it. Let's just review as-is.

> 
> > On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> > > Stub out the new page fault layer and add kernel documentation.
> > > This
> > > is
> > > intended as a replacement for the GT page fault layer, enabling
> > > multiple
> > > producers to hook into a shared page fault consumer interface.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/Makefile             |   1 +
> > >  drivers/gpu/drm/xe/xe_pagefault.c       |  63 ++++++++++++
> > >  drivers/gpu/drm/xe/xe_pagefault.h       |  19 ++++
> > >  drivers/gpu/drm/xe/xe_pagefault_types.h | 125
> > > ++++++++++++++++++++++++
> > >  4 files changed, 208 insertions(+)
> > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.c
> > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.h
> > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault_types.h
> > > 
> > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > b/drivers/gpu/drm/xe/Makefile
> > > index 8e0c3412a757..6fbebafe79c9 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -93,6 +93,7 @@ xe-y += xe_bb.o \
> > >         xe_nvm.o \
> > >         xe_oa.o \
> > >         xe_observation.o \
> > > +       xe_pagefault.o \
> > >         xe_pat.o \
> > >         xe_pci.o \
> > >         xe_pcode.o \
> > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> > > b/drivers/gpu/drm/xe/xe_pagefault.c
> > > new file mode 100644
> > > index 000000000000..3ce0e8d74b9d
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > > @@ -0,0 +1,63 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2025 Intel Corporation
> > > + */
> > > +
> > > +#include "xe_pagefault.h"
> > > +#include "xe_pagefault_types.h"
> > > +
> > > +/**
> > > + * DOC: Xe page faults
> > > + *
> > > + * Xe page faults are handled in two layers. The producer layer
> > > interacts with
> > > + * hardware or firmware to receive and parse faults into struct
> > > xe_pagefault,
> > > + * then forwards them to the consumer. The consumer layer
> > > services
> > > the faults
> > > + * (e.g., memory migration, page table updates) and acknowledges
> > > the
> > > result back
> > > + * to the producer, which then forwards the results to the
> > > hardware
> > > or firmware.
> > > + * The consumer uses a page fault queue sized to absorb all
> > > potential faults and
> > > + * a multi-threaded worker to process them. Multiple producers
> > > are
> > > supported,
> > > + * with a single shared consumer.
> > > + */
> > > +
> > > +/**
> > > + * xe_pagefault_init() - Page fault init
> > > + * @xe: xe device instance
> > > + *
> > > + * Initialize Xe page fault state. Must be done after reading
> > > fuses.
> > > + *
> > > + * Return: 0 on Success, errno on failure
> > > + */
> > > +int xe_pagefault_init(struct xe_device *xe)
> > > +{
> > > +       /* TODO - implement */
> > > +       return 0;
> > > +}
> > > +
> > > +/**
> > > + * xe_pagefault_reset() - Page fault reset for a GT
> > > + * @xe: xe device instance
> > > + * @gt: GT being reset
> > > + *
> > > + * Reset the Xe page fault state for a GT; that is, squash any
> > > pending faults on
> > > + * the GT.
> > > + */
> > > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
> > > +{
> > > +       /* TODO - implement */
> > > +}
> > > +
> > > +/**
> > > + * xe_pagefault_handler() - Page fault handler
> > > + * @xe: xe device instance
> > > + * @pf: Page fault
> > > + *
> > > + * Sink the page fault to a queue (i.e., a memory buffer) and
> > > queue
> > > a worker to
> > > + * service it. Safe to be called from IRQ or process context.
> > > Reclaim safe.
> > > + *
> > > + * Return: 0 on success, errno on failure
> > > + */
> > > +int xe_pagefault_handler(struct xe_device *xe, struct
> > > xe_pagefault
> > > *pf)
> > > +{
> > > +       /* TODO - implement */
> > > +       return 0;
> > > +}
> > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.h
> > > b/drivers/gpu/drm/xe/xe_pagefault.h
> > > new file mode 100644
> > > index 000000000000..bd0cdf9ed37f
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_pagefault.h
> > > @@ -0,0 +1,19 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2025 Intel Corporation
> > > + */
> > > +
> > > +#ifndef _XE_PAGEFAULT_H_
> > > +#define _XE_PAGEFAULT_H_
> > > +
> > > +struct xe_device;
> > > +struct xe_gt;
> > > +struct xe_pagefault;
> > > +
> > > +int xe_pagefault_init(struct xe_device *xe);
> > > +
> > > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt);
> > > +
> > > +int xe_pagefault_handler(struct xe_device *xe, struct
> > > xe_pagefault
> > > *pf);
> > > +
> > > +#endif
> > > diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > new file mode 100644
> > > index 000000000000..fcff84f93dd8
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > @@ -0,0 +1,125 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2025 Intel Corporation
> > > + */
> > > +
> > > +#ifndef _XE_PAGEFAULT_TYPES_H_
> > > +#define _XE_PAGEFAULT_TYPES_H_
> > > +
> > > +#include <linux/workqueue.h>
> > > +
> > > +struct xe_pagefault;
> > > +struct xe_gt;
> > 
> > Nit: Maybe reverse these structs to be in alphabetical order
> > 
> 
> Yes, that is the preferred style. Will fix.
> 
> > > +
> > > +/** enum xe_pagefault_access_type - Xe page fault access type */
> > > +enum xe_pagefault_access_type {
> > > +       /** @XE_PAGEFAULT_ACCESS_TYPE_READ: Read access type */
> > > +       XE_PAGEFAULT_ACCESS_TYPE_READ   = 0,
> > > +       /** @XE_PAGEFAULT_ACCESS_TYPE_WRITE: Write access type */
> > > +       XE_PAGEFAULT_ACCESS_TYPE_WRITE  = 1,
> > > +       /** @XE_PAGEFAULT_ACCESS_TYPE_ATOMIC: Atomic access type
> > > */
> > > +       XE_PAGEFAULT_ACCESS_TYPE_ATOMIC = 2,
> > > +};
> > > +
> > > +/** enum xe_pagefault_type - Xe page fault type */
> > > +enum xe_pagefault_type {
> > > +       /** @XE_PAGEFAULT_TYPE_NOT_PRESENT: Not present */
> > > +       XE_PAGEFAULT_TYPE_NOT_PRESENT           = 0,
> > > +       /** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Write
> > > access
> > > violation */
> > > +       XE_PAGEFAULT_WRITE_ACCESS_VIOLATION     = 1,
> > > +       /** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Atomic
> > > access
> > > violation */
> > 
> > XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION ->
> > XE_PAGEFAULT_ACCESS_TYPE_ATOMIC
> > 
> 
> The intended prefix here is 'XE_PAGEFAULT_TYPE_' to normalize the
> naming
> with 'enum xe_pagefault_type'.

Ah sorry you're right. I also should have been more specific that I
meant this should be ATOMIC access vs WRITE access, so:
XE_PAGEFAULT_TYPE_ATOMIC_ACCESS_VIOLATION

> 
> > > +       XE_PAGEFAULT_ATOMIC_ACCESS_VIOLATION    = 2,
> > > +};
> > > +
> > > +/** struct xe_pagefault_ops - Xe pagefault ops (producer) */
> > > +struct xe_pagefault_ops {
> > > +       /**
> > > +        * @ack_fault: Ack fault
> > > +        * @pf: Page fault
> > > +        * @err: Error state of fault
> > > +        *
> > > +        * Page fault producer receives acknowledgment from the
> > > consumer and
> > > +        * sends the result to the HW/FW interface.
> > > +        */
> > > +       void (*ack_fault)(struct xe_pagefault *pf, int err);
> > > +};
> > > +
> > > +/**
> > > + * struct xe_pagefault - Xe page fault
> > > + *
> > > + * Generic page fault structure for communication between
> > > producer
> > > and consumer.
> > > + * Carefully sized to be 64 bytes.
> > > + */
> > > +struct xe_pagefault {
> > > +       /**
> > > +        * @gt: GT of fault
> > > +        *
> > > +        * XXX: We may want to decouple the GT from individual
> > > faults, as it's
> > > +        * unclear whether future platforms will always have a GT
> > > for
> > > all page
> > > +        * fault producers. Internally, the GT is used for stats,
> > > identifying
> > > +        * the appropriate VRAM region, and locating the
> > > migration
> > > queue.
> > > +        * Leaving this as-is for now, but we can revisit later
> > > to
> > > see if we
> > > +        * can convert it to use the Xe device pointer instead.
> > > +        */
> > 
> > What if instead of assuming the GT stays static and we eventually
> > remove if we have some new HW abstraction layer that isn't a GT but
> > still uses the page fault, we instead push to have said theoretical
> > abstraction layer overload the GT here like we're doing with
> > primary
> > and media today. Then we can keep the interface here simple and
> > just
> > leave this in there, or change in the future if that doesn't make
> > sense
> > without the suggestive comment?
> > 
> 
> I can remove this comment, as it adds some confusion. Hopefully, we
> always have a GT. I was just speculating about future cases where we
> might not have one. From a purely interface perspective, it would be
> ideal to completely decouple the GT here.
>  
> > > +       struct xe_gt *gt;
> > > +       /**
> > > +        * @consumer: State for the software handling the fault.
> > > Populated by
> > > +        * the producer and may be modified by the consumer to
> > > communicate
> > > +        * information back to the producer upon fault
> > > acknowledgment.
> > > +        */
> > > +       struct {
> > > +               /** @consumer.page_addr: address of page fault */
> > > +               u64 page_addr;
> > > +               /** @consumer.asid: address space ID */
> > > +               u32 asid;
> > 
> > Can we just call this an ID instead of a pasid or asid? I.e. the ID
> > could be anything, not strictly process-bound.
> > 
> 
> I think the idea here is that this serves as the ID for our reverse
> VM
> lookup mechanism in the KMD. We call it ASID throughout the codebase
> today, so we’re stuck with the name—though it may or may not have any
> actual meaning in hardware, depending on the producer. For example,
> if
> the producer receives a fault based on a queue ID, we’d look up the
> queue and then pass in q->vm.asid.
> 
> We could even have the producer look up the VM directly, if
> preferred,
> and just pass that over. However, that would require a few more bits
> here and might introduce lifetime issues—for example, we’d have to
> refcount the VM.

Yeah I mean some of those problems we can solve if they come up later.
Just thinking having something more generic here would be nice. But I
agree on the cross-KMD usage. We can keep this and change it more
broadly if that makes sense later.

> 
> > > +               /** @consumer.access_type: access type */
> > > +               u8 access_type;
> > > +               /** @consumer.fault_type: fault type */
> > > +               u8 fault_type;
> > > +#define XE_PAGEFAULT_LEVEL_NACK                0xff    /*
> > > Producer
> > > indicates nack fault */
> > > +               /** @consumer.fault_level: fault level */
> > > +               u8 fault_level;
> > > +               /** @consumer.engine_class: engine class */
> > > +               u8 engine_class;
> > > +               /** consumer.reserved: reserved bits for future
> > > expansion */
> > > +               u64 reserved;
> > 
> > What about engine instance? Or is that going to overload reserved
> > here?
> > 
> 
> reserved could be used to include 'engine instance' if required, is
> there for future expansion and also to have structure sized to 64
> bytes.
> 
> I include fault_level, engine_class as I though both were used by [1]
> but now that I looked again only fault level is used so I guess
> engine_class can be pulled out too unless we want to keep for the
> only
> place in which it is used (debug messages).

I think today hardware/GuC provides both engine class and engine
instance which is why I mentioned. We can ignore those fields if we
don't feel they are valuable/relevant, but at least today we are
reading those and printing them out.

Thanks,
Stuart

> 
> Matt
> 
> [1] https://patchwork.freedesktop.org/series/148727/
> 
> > Thanks,
> > Stuart
> > 
> > > +       } consumer;
> > > +       /**
> > > +        * @producer: State for the producer (i.e., HW/FW
> > > interface).
> > > Populated
> > > +        * by the producer and should not be modified—or even
> > > inspected—by the
> > > +        * consumer, except for calling operations.
> > > +        */
> > > +       struct {
> > > +               /** @producer.private: private pointer */
> > > +               void *private;
> > > +               /** @producer.ops: operations */
> > > +               const struct xe_pagefault_ops *ops;
> > > +#define XE_PAGEFAULT_PRODUCER_MSG_LEN_DW       4
> > > +               /**
> > > +                * producer.msg: page fault message, used by
> > > producer
> > > in fault
> > > +                * acknowledgement to formulate response to HW/FW
> > > interface.
> > > +                */
> > > +               u32 msg[XE_PAGEFAULT_PRODUCER_MSG_LEN_DW];
> > > +       } producer;
> > > +};
> > > +
> > > +/** struct xe_pagefault_queue: Xe pagefault queue (consumer) */
> > > +struct xe_pagefault_queue {
> > > +       /**
> > > +        * @data: Data in queue containing struct xe_pagefault,
> > > protected by
> > > +        * @lock
> > > +        */
> > > +       void *data;
> > > +       /** @size: Size of queue in bytes */
> > > +       u32 size;
> > > +       /** @head: Head pointer in bytes, moved by producer,
> > > protected by @lock */
> > > +       u32 head;
> > > +       /** @tail: Tail pointer in bytes, moved by consumer,
> > > protected by @lock */
> > > +       u32 tail;
> > > +       /** @lock: protects page fault queue */
> > > +       spinlock_t lock;
> > > +       /** @worker: to process page faults */
> > > +       struct work_struct worker;
> > > +};
> > > +
> > > +#endif
> > 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/11] drm/xe: Stub out new pagefault layer
  2025-08-07 17:20       ` Summers, Stuart
@ 2025-08-07 18:10         ` Matthew Brost
  2025-08-28 20:18           ` Summers, Stuart
  0 siblings, 1 reply; 51+ messages in thread
From: Matthew Brost @ 2025-08-07 18:10 UTC (permalink / raw)
  To: Summers, Stuart
  Cc: intel-xe@lists.freedesktop.org, Dugast, Francois,
	Ghimiray, Himal Prasad, Mrozek, Michal,
	thomas.hellstrom@linux.intel.com

On Thu, Aug 07, 2025 at 11:20:06AM -0600, Summers, Stuart wrote:
> On Wed, 2025-08-06 at 16:53 -0700, Matthew Brost wrote:
> > On Wed, Aug 06, 2025 at 05:01:12PM -0600, Summers, Stuart wrote:
> > > Few basic comments below to start. I personally would rather this
> > > be
> > > brought over from the existing fault handler rather than creating
> > > something entirely new and then clobbering the older stuff - just
> > > so
> > > the review of message format requests/replies is easier to review
> > > and
> > > where we're deviating from the existing external interfaces
> > > (HW/FW/GuC/etc). You already have this here though so not a huge
> > > deal.
> > > I think most of this was in the giant blob of patches that got
> > > merged
> > > with the initial driver, so I guess the counter argument is we can
> > > have
> > > easy to reference historical reviews now.
> > > 
> > 
> > Yes, page fault code is largely just a big blob from the original Xe
> > patch that wasn't the most well thought out code. We still have that
> > history in the tree, just git blame won't work, so you'd need to know
> > where to look if want that.
> > 
> > I don't think there is a great way to pull this over, unless patches
> > 2-7
> > are squashed into a single patch + a couple of 'git mv' are used.
> 
> No definitely don't think that's worth it. Let's just review as-is.
> 
> > 
> > > On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> > > > Stub out the new page fault layer and add kernel documentation.
> > > > This
> > > > is
> > > > intended as a replacement for the GT page fault layer, enabling
> > > > multiple
> > > > producers to hook into a shared page fault consumer interface.
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/Makefile             |   1 +
> > > >  drivers/gpu/drm/xe/xe_pagefault.c       |  63 ++++++++++++
> > > >  drivers/gpu/drm/xe/xe_pagefault.h       |  19 ++++
> > > >  drivers/gpu/drm/xe/xe_pagefault_types.h | 125
> > > > ++++++++++++++++++++++++
> > > >  4 files changed, 208 insertions(+)
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.c
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.h
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault_types.h
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > > b/drivers/gpu/drm/xe/Makefile
> > > > index 8e0c3412a757..6fbebafe79c9 100644
> > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > @@ -93,6 +93,7 @@ xe-y += xe_bb.o \
> > > >         xe_nvm.o \
> > > >         xe_oa.o \
> > > >         xe_observation.o \
> > > > +       xe_pagefault.o \
> > > >         xe_pat.o \
> > > >         xe_pci.o \
> > > >         xe_pcode.o \
> > > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> > > > b/drivers/gpu/drm/xe/xe_pagefault.c
> > > > new file mode 100644
> > > > index 000000000000..3ce0e8d74b9d
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > > > @@ -0,0 +1,63 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2025 Intel Corporation
> > > > + */
> > > > +
> > > > +#include "xe_pagefault.h"
> > > > +#include "xe_pagefault_types.h"
> > > > +
> > > > +/**
> > > > + * DOC: Xe page faults
> > > > + *
> > > > + * Xe page faults are handled in two layers. The producer layer
> > > > interacts with
> > > > + * hardware or firmware to receive and parse faults into struct
> > > > xe_pagefault,
> > > > + * then forwards them to the consumer. The consumer layer
> > > > services
> > > > the faults
> > > > + * (e.g., memory migration, page table updates) and acknowledges
> > > > the
> > > > result back
> > > > + * to the producer, which then forwards the results to the
> > > > hardware
> > > > or firmware.
> > > > + * The consumer uses a page fault queue sized to absorb all
> > > > potential faults and
> > > > + * a multi-threaded worker to process them. Multiple producers
> > > > are
> > > > supported,
> > > > + * with a single shared consumer.
> > > > + */
> > > > +
> > > > +/**
> > > > + * xe_pagefault_init() - Page fault init
> > > > + * @xe: xe device instance
> > > > + *
> > > > + * Initialize Xe page fault state. Must be done after reading
> > > > fuses.
> > > > + *
> > > > + * Return: 0 on Success, errno on failure
> > > > + */
> > > > +int xe_pagefault_init(struct xe_device *xe)
> > > > +{
> > > > +       /* TODO - implement */
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * xe_pagefault_reset() - Page fault reset for a GT
> > > > + * @xe: xe device instance
> > > > + * @gt: GT being reset
> > > > + *
> > > > + * Reset the Xe page fault state for a GT; that is, squash any
> > > > pending faults on
> > > > + * the GT.
> > > > + */
> > > > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
> > > > +{
> > > > +       /* TODO - implement */
> > > > +}
> > > > +
> > > > +/**
> > > > + * xe_pagefault_handler() - Page fault handler
> > > > + * @xe: xe device instance
> > > > + * @pf: Page fault
> > > > + *
> > > > + * Sink the page fault to a queue (i.e., a memory buffer) and
> > > > queue
> > > > a worker to
> > > > + * service it. Safe to be called from IRQ or process context.
> > > > Reclaim safe.
> > > > + *
> > > > + * Return: 0 on success, errno on failure
> > > > + */
> > > > +int xe_pagefault_handler(struct xe_device *xe, struct
> > > > xe_pagefault
> > > > *pf)
> > > > +{
> > > > +       /* TODO - implement */
> > > > +       return 0;
> > > > +}
> > > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.h
> > > > b/drivers/gpu/drm/xe/xe_pagefault.h
> > > > new file mode 100644
> > > > index 000000000000..bd0cdf9ed37f
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_pagefault.h
> > > > @@ -0,0 +1,19 @@
> > > > +/* SPDX-License-Identifier: MIT */
> > > > +/*
> > > > + * Copyright © 2025 Intel Corporation
> > > > + */
> > > > +
> > > > +#ifndef _XE_PAGEFAULT_H_
> > > > +#define _XE_PAGEFAULT_H_
> > > > +
> > > > +struct xe_device;
> > > > +struct xe_gt;
> > > > +struct xe_pagefault;
> > > > +
> > > > +int xe_pagefault_init(struct xe_device *xe);
> > > > +
> > > > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt);
> > > > +
> > > > +int xe_pagefault_handler(struct xe_device *xe, struct
> > > > xe_pagefault
> > > > *pf);
> > > > +
> > > > +#endif
> > > > diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > > b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > > new file mode 100644
> > > > index 000000000000..fcff84f93dd8
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > > @@ -0,0 +1,125 @@
> > > > +/* SPDX-License-Identifier: MIT */
> > > > +/*
> > > > + * Copyright © 2025 Intel Corporation
> > > > + */
> > > > +
> > > > +#ifndef _XE_PAGEFAULT_TYPES_H_
> > > > +#define _XE_PAGEFAULT_TYPES_H_
> > > > +
> > > > +#include <linux/workqueue.h>
> > > > +
> > > > +struct xe_pagefault;
> > > > +struct xe_gt;
> > > 
> > > Nit: Maybe reverse these structs to be in alphabetical order
> > > 
> > 
> > Yes, that is the preferred style. Will fix.
> > 
> > > > +
> > > > +/** enum xe_pagefault_access_type - Xe page fault access type */
> > > > +enum xe_pagefault_access_type {
> > > > +       /** @XE_PAGEFAULT_ACCESS_TYPE_READ: Read access type */
> > > > +       XE_PAGEFAULT_ACCESS_TYPE_READ   = 0,
> > > > +       /** @XE_PAGEFAULT_ACCESS_TYPE_WRITE: Write access type */
> > > > +       XE_PAGEFAULT_ACCESS_TYPE_WRITE  = 1,
> > > > +       /** @XE_PAGEFAULT_ACCESS_TYPE_ATOMIC: Atomic access type
> > > > */
> > > > +       XE_PAGEFAULT_ACCESS_TYPE_ATOMIC = 2,
> > > > +};
> > > > +
> > > > +/** enum xe_pagefault_type - Xe page fault type */
> > > > +enum xe_pagefault_type {
> > > > +       /** @XE_PAGEFAULT_TYPE_NOT_PRESENT: Not present */
> > > > +       XE_PAGEFAULT_TYPE_NOT_PRESENT           = 0,
> > > > +       /** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Write
> > > > access
> > > > violation */
> > > > +       XE_PAGEFAULT_WRITE_ACCESS_VIOLATION     = 1,
> > > > +       /** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Atomic
> > > > access
> > > > violation */
> > > 
> > > XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION ->
> > > XE_PAGEFAULT_ACCESS_TYPE_ATOMIC
> > > 
> > 
> > The intended prefix here is 'XE_PAGEFAULT_TYPE_' to normalize the
> > naming
> > with 'enum xe_pagefault_type'.
> 
> Ah sorry you're right. I also should have been more specific that I
> meant this should be ATOMIC access vs WRITE access, so:
> XE_PAGEFAULT_TYPE_ATOMIC_ACCESS_VIOLATION
> 
> > 
> > > > +       XE_PAGEFAULT_ATOMIC_ACCESS_VIOLATION    = 2,
> > > > +};
> > > > +
> > > > +/** struct xe_pagefault_ops - Xe pagefault ops (producer) */
> > > > +struct xe_pagefault_ops {
> > > > +       /**
> > > > +        * @ack_fault: Ack fault
> > > > +        * @pf: Page fault
> > > > +        * @err: Error state of fault
> > > > +        *
> > > > +        * Page fault producer receives acknowledgment from the
> > > > consumer and
> > > > +        * sends the result to the HW/FW interface.
> > > > +        */
> > > > +       void (*ack_fault)(struct xe_pagefault *pf, int err);
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct xe_pagefault - Xe page fault
> > > > + *
> > > > + * Generic page fault structure for communication between
> > > > producer
> > > > and consumer.
> > > > + * Carefully sized to be 64 bytes.
> > > > + */
> > > > +struct xe_pagefault {
> > > > +       /**
> > > > +        * @gt: GT of fault
> > > > +        *
> > > > +        * XXX: We may want to decouple the GT from individual
> > > > faults, as it's
> > > > +        * unclear whether future platforms will always have a GT
> > > > for
> > > > all page
> > > > +        * fault producers. Internally, the GT is used for stats,
> > > > identifying
> > > > +        * the appropriate VRAM region, and locating the
> > > > migration
> > > > queue.
> > > > +        * Leaving this as-is for now, but we can revisit later
> > > > to
> > > > see if we
> > > > +        * can convert it to use the Xe device pointer instead.
> > > > +        */
> > > 
> > > What if instead of assuming the GT stays static and we eventually
> > > remove if we have some new HW abstraction layer that isn't a GT but
> > > still uses the page fault, we instead push to have said theoretical
> > > abstraction layer overload the GT here like we're doing with
> > > primary
> > > and media today. Then we can keep the interface here simple and
> > > just
> > > leave this in there, or change in the future if that doesn't make
> > > sense
> > > without the suggestive comment?
> > > 
> > 
> > I can remove this comment, as it adds some confusion. Hopefully, we
> > always have a GT. I was just speculating about future cases where we
> > might not have one. From a purely interface perspective, it would be
> > ideal to completely decouple the GT here.
> >  
> > > > +       struct xe_gt *gt;
> > > > +       /**
> > > > +        * @consumer: State for the software handling the fault.
> > > > Populated by
> > > > +        * the producer and may be modified by the consumer to
> > > > communicate
> > > > +        * information back to the producer upon fault
> > > > acknowledgment.
> > > > +        */
> > > > +       struct {
> > > > +               /** @consumer.page_addr: address of page fault */
> > > > +               u64 page_addr;
> > > > +               /** @consumer.asid: address space ID */
> > > > +               u32 asid;
> > > 
> > > Can we just call this an ID instead of a pasid or asid? I.e. the ID
> > > could be anything, not strictly process-bound.
> > > 
> > 
> > I think the idea here is that this serves as the ID for our reverse
> > VM
> > lookup mechanism in the KMD. We call it ASID throughout the codebase
> > today, so we’re stuck with the name—though it may or may not have any
> > actual meaning in hardware, depending on the producer. For example,
> > if
> > the producer receives a fault based on a queue ID, we’d look up the
> > queue and then pass in q->vm.asid.
> > 
> > We could even have the producer look up the VM directly, if
> > preferred,
> > and just pass that over. However, that would require a few more bits
> > here and might introduce lifetime issues—for example, we’d have to
> > refcount the VM.
> 
> Yeah I mean some of those problems we can solve if they come up later.
> Just thinking having something more generic here would be nice. But I
> agree on the cross-KMD usage. We can keep this and change it more
> broadly if that makes sense later.
> 

I think the point here is we are always going to have a VM which is
required by the consumer to service the fault, the producer side needs
parse the fault and figure out a known value in the KMD which
corresponds to a VM and pass it over. We call this value asid today
(also the name of hardware interface + what we program into the LRC) but
could rename this everywhere in KMD if that makes sense. e.g.,
kmd_vm_id (vm_id is a user space name / value which means something
different).

> > 
> > > > +               /** @consumer.access_type: access type */
> > > > +               u8 access_type;
> > > > +               /** @consumer.fault_type: fault type */
> > > > +               u8 fault_type;
> > > > +#define XE_PAGEFAULT_LEVEL_NACK                0xff    /*
> > > > Producer
> > > > indicates nack fault */
> > > > +               /** @consumer.fault_level: fault level */
> > > > +               u8 fault_level;
> > > > +               /** @consumer.engine_class: engine class */
> > > > +               u8 engine_class;
> > > > +               /** consumer.reserved: reserved bits for future
> > > > expansion */
> > > > +               u64 reserved;
> > > 
> > > What about engine instance? Or is that going to overload reserved
> > > here?
> > > 
> > 
> > reserved could be used to include 'engine instance' if required, is
> > there for future expansion and also to have structure sized to 64
> > bytes.
> > 
> > I include fault_level, engine_class as I though both were used by [1]
> > but now that I looked again only fault level is used so I guess
> > engine_class can be pulled out too unless we want to keep for the
> > only
> > place in which it is used (debug messages).
> 
> I think today hardware/GuC provides both engine class and engine
> instance which is why I mentioned. We can ignore those fields if we
> don't feel they are valuable/relevant, but at least today we are
> reading those and printing them out.
> 

Yes, the debug message drops engine instance due to not passing this
value over. I think that is ok, engine class typically all we care about
anyways.

Matt

> Thanks,
> Stuart
> 
> > 
> > Matt
> > 
> > [1] https://patchwork.freedesktop.org/series/148727/
> > 
> > > Thanks,
> > > Stuart
> > > 
> > > > +       } consumer;
> > > > +       /**
> > > > +        * @producer: State for the producer (i.e., HW/FW
> > > > interface).
> > > > Populated
> > > > +        * by the producer and should not be modified—or even
> > > > inspected—by the
> > > > +        * consumer, except for calling operations.
> > > > +        */
> > > > +       struct {
> > > > +               /** @producer.private: private pointer */
> > > > +               void *private;
> > > > +               /** @producer.ops: operations */
> > > > +               const struct xe_pagefault_ops *ops;
> > > > +#define XE_PAGEFAULT_PRODUCER_MSG_LEN_DW       4
> > > > +               /**
> > > > +                * producer.msg: page fault message, used by
> > > > producer
> > > > in fault
> > > > +                * acknowledgement to formulate response to HW/FW
> > > > interface.
> > > > +                */
> > > > +               u32 msg[XE_PAGEFAULT_PRODUCER_MSG_LEN_DW];
> > > > +       } producer;
> > > > +};
> > > > +
> > > > +/** struct xe_pagefault_queue: Xe pagefault queue (consumer) */
> > > > +struct xe_pagefault_queue {
> > > > +       /**
> > > > +        * @data: Data in queue containing struct xe_pagefault,
> > > > protected by
> > > > +        * @lock
> > > > +        */
> > > > +       void *data;
> > > > +       /** @size: Size of queue in bytes */
> > > > +       u32 size;
> > > > +       /** @head: Head pointer in bytes, moved by producer,
> > > > protected by @lock */
> > > > +       u32 head;
> > > > +       /** @tail: Tail pointer in bytes, moved by consumer,
> > > > protected by @lock */
> > > > +       u32 tail;
> > > > +       /** @lock: protects page fault queue */
> > > > +       spinlock_t lock;
> > > > +       /** @worker: to process page faults */
> > > > +       struct work_struct worker;
> > > > +};
> > > > +
> > > > +#endif
> > > 
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/11] drm/xe: Implement xe_pagefault_init
  2025-08-06 23:59     ` Matthew Brost
@ 2025-08-07 18:22       ` Summers, Stuart
  0 siblings, 0 replies; 51+ messages in thread
From: Summers, Stuart @ 2025-08-07 18:22 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Dugast, Francois,
	Ghimiray, Himal Prasad, Mrozek, Michal,
	thomas.hellstrom@linux.intel.com

On Wed, 2025-08-06 at 16:59 -0700, Matthew Brost wrote:
> On Wed, Aug 06, 2025 at 05:08:18PM -0600, Summers, Stuart wrote:
> > On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> > > Create pagefault queues and initialize them.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_device.c       |  5 ++
> > >  drivers/gpu/drm/xe/xe_device_types.h |  6 ++
> > >  drivers/gpu/drm/xe/xe_pagefault.c    | 93
> > > +++++++++++++++++++++++++++-
> > >  3 files changed, 102 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > b/drivers/gpu/drm/xe/xe_device.c
> > > index 57edbc63da6f..c7c8aee03841 100644
> > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > @@ -50,6 +50,7 @@
> > >  #include "xe_nvm.h"
> > >  #include "xe_oa.h"
> > >  #include "xe_observation.h"
> > > +#include "xe_pagefault.h"
> > >  #include "xe_pat.h"
> > >  #include "xe_pcode.h"
> > >  #include "xe_pm.h"
> > > @@ -890,6 +891,10 @@ int xe_device_probe(struct xe_device *xe)
> > >         if (err)
> > >                 return err;
> > >  
> > > +       err = xe_pagefault_init(xe);
> > > +       if (err)
> > > +               return err;
> > > +
> > >         xe_nvm_init(xe);
> > >  
> > >         err = xe_heci_gsc_init(xe);
> > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > > b/drivers/gpu/drm/xe/xe_device_types.h
> > > index 01e8fa0d2f9f..6aa119026ce9 100644
> > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > @@ -17,6 +17,7 @@
> > >  #include "xe_lmtt_types.h"
> > >  #include "xe_memirq_types.h"
> > >  #include "xe_oa_types.h"
> > > +#include "xe_pagefault_types.h"
> > >  #include "xe_platform_types.h"
> > >  #include "xe_pmu_types.h"
> > >  #include "xe_pt_types.h"
> > > @@ -394,6 +395,11 @@ struct xe_device {
> > >                 u32 next_asid;
> > >                 /** @usm.lock: protects UM state */
> > >                 struct rw_semaphore lock;
> > > +               /** @usm.pf_wq: page fault work queue, unbound,
> > > high
> > > priority */
> > > +               struct workqueue_struct *pf_wq;
> > > +#define XE_PAGEFAULT_QUEUE_COUNT       4
> > > +               /** @pf_queue: Page fault queues */
> > > +               struct xe_pagefault_queue
> > > pf_queue[XE_PAGEFAULT_QUEUE_COUNT];
> > >         } usm;
> > >  
> > >         /** @pinned: pinned BO state */
> > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> > > b/drivers/gpu/drm/xe/xe_pagefault.c
> > > index 3ce0e8d74b9d..14304c41eb23 100644
> > > --- a/drivers/gpu/drm/xe/xe_pagefault.c
> > > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > > @@ -3,6 +3,10 @@
> > >   * Copyright © 2025 Intel Corporation
> > >   */
> > >  
> > > +#include <drm/drm_managed.h>
> > > +
> > > +#include "xe_device.h"
> > > +#include "xe_gt_types.h"
> > >  #include "xe_pagefault.h"
> > >  #include "xe_pagefault_types.h"
> > >  
> > > @@ -19,6 +23,71 @@
> > >   * with a single shared consumer.
> > >   */
> > >  
> > > +static int xe_pagefault_entry_size(void)
> > > +{
> > > +       return roundup_pow_of_two(sizeof(struct xe_pagefault));
> > 
> > Nice thanks!
> > 
> > > +}
> > > +
> > > +static void xe_pagefault_queue_work(struct work_struct *w)
> > > +{
> > > +       /* TODO: Implement */
> > > +}
> > > +
> > > +static int xe_pagefault_queue_init(struct xe_device *xe,
> > > +                                  struct xe_pagefault_queue
> > > *pf_queue)
> > > +{
> > > +       struct xe_gt *gt;
> > > +       int total_num_eus = 0;
> > > +       u8 id;
> > > +
> > > +       for_each_gt(gt, xe, id) {
> > > +               xe_dss_mask_t all_dss;
> > > +               int num_dss, num_eus;
> > > +
> > > +               bitmap_or(all_dss, gt->fuse_topo.g_dss_mask,
> > > +                         gt->fuse_topo.c_dss_mask,
> > > XE_MAX_DSS_FUSE_BITS);
> > > +
> > > +               num_dss = bitmap_weight(all_dss,
> > > XE_MAX_DSS_FUSE_BITS);
> > > +               num_eus = bitmap_weight(gt-
> > > > fuse_topo.eu_mask_per_dss,
> > > +                                       XE_MAX_EU_FUSE_BITS) *
> > > num_dss;
> > > +
> > > +               total_num_eus += num_eus;
> > 
> > I'm behind on that patch I had posted a while back to update this
> > algorithm :(. Want to pull that calculation in here directly so we
> > can
> > remove the PF_MULTIPLIER you have below?
> > 
> > See
> > https://patchwork.freedesktop.org/patch/651415/?series=148523&rev=1
> > 
> > I can also rework that on top of this if you'd prefer, either way
> > is
> > fine with me.
> > 
> 
> Either way works for me. If you get this one in ahead of me, no
> issuing
> picking it up in a rebase or ofc if lands first you can post on top
> of
> this. If you are concerned about history, then the later might be
> better.

Ok I'll hold off pushing mine back to the list until you're done. I was
also suggesting you could just pull my changes into your patch if you
wanted, although maybe that complicates what is mostly an internal
interface update here with more functional-focused changes. But if you
decide not to do that I'll just wait until yours land and pull it in on
top.

Thanks,
Stuart

> 
> Matt
> 
> > Thanks,
> > Stuart
> > 
> > > +       }
> > > +
> > > +       xe_assert(xe, total_num_eus);
> > > +
> > > +       /*
> > > +        * user can issue separate page faults per EU and per CS
> > > +        *
> > > +        * XXX: Multiplier required as compute UMD are getting PF
> > > queue errors
> > > +        * without it. Follow on why this multiplier is required.
> > > +        */
> > > +#define PF_MULTIPLIER  8
> > > +       pf_queue->size = (total_num_eus + XE_NUM_HW_ENGINES) *
> > > +               xe_pagefault_entry_size() * PF_MULTIPLIER;
> > > +       pf_queue->size = roundup_pow_of_two(pf_queue->size);
> > > +#undef PF_MULTIPLIER
> > > +
> > > +       drm_dbg(&xe->drm, "xe_pagefault_entry_size=%d,
> > > total_num_eus=%d, pf_queue->size=%u",
> > > +               xe_pagefault_entry_size(), total_num_eus,
> > > pf_queue-
> > > > size);
> > > +
> > > +       pf_queue->data = devm_kzalloc(xe->drm.dev, pf_queue-
> > > >size,
> > > GFP_KERNEL);
> > > +       if (!pf_queue->data)
> > > +               return -ENOMEM;
> > > +
> > > +       spin_lock_init(&pf_queue->lock);
> > > +       INIT_WORK(&pf_queue->worker, xe_pagefault_queue_work);
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static void xe_pagefault_fini(void *arg)
> > > +{
> > > +       struct xe_device *xe = arg;
> > > +
> > > +       destroy_workqueue(xe->usm.pf_wq);
> > > +}
> > > +
> > >  /**
> > >   * xe_pagefault_init() - Page fault init
> > >   * @xe: xe device instance
> > > @@ -29,8 +98,28 @@
> > >   */
> > >  int xe_pagefault_init(struct xe_device *xe)
> > >  {
> > > -       /* TODO - implement */
> > > -       return 0;
> > > +       int err, i;
> > > +
> > > +       if (!xe->info.has_usm)
> > > +               return 0;
> > > +
> > > +       xe->usm.pf_wq =
> > > alloc_workqueue("xe_page_fault_work_queue",
> > > +                                       WQ_UNBOUND | WQ_HIGHPRI,
> > > +                                       XE_PAGEFAULT_QUEUE_COUNT)
> > > ;
> > > +       if (!xe->usm.pf_wq)
> > > +               return -ENOMEM;
> > > +
> > > +       for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i) {
> > > +               err = xe_pagefault_queue_init(xe, xe-
> > > >usm.pf_queue +
> > > i);
> > > +               if (err)
> > > +                       goto err_out;
> > > +       }
> > > +
> > > +       return devm_add_action_or_reset(xe->drm.dev,
> > > xe_pagefault_fini, xe);
> > > +
> > > +err_out:
> > > +       destroy_workqueue(xe->usm.pf_wq);
> > > +       return err;
> > >  }
> > >  
> > >  /**
> > 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/11] drm/xe: Implement xe_pagefault_reset
  2025-08-07  0:12     ` Matthew Brost
@ 2025-08-07 18:29       ` Summers, Stuart
  0 siblings, 0 replies; 51+ messages in thread
From: Summers, Stuart @ 2025-08-07 18:29 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Dugast, Francois,
	Ghimiray, Himal Prasad, Mrozek, Michal,
	thomas.hellstrom@linux.intel.com

On Wed, 2025-08-06 at 17:12 -0700, Matthew Brost wrote:
> On Wed, Aug 06, 2025 at 05:16:26PM -0600, Summers, Stuart wrote:
> > On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> > > Squash any pending faults on the GT being reset by setting the GT
> > > field
> > > in struct xe_pagefault to NULL.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_gt.c        |  2 ++
> > >  drivers/gpu/drm/xe/xe_pagefault.c | 23 ++++++++++++++++++++++-
> > >  2 files changed, 24 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_gt.c
> > > b/drivers/gpu/drm/xe/xe_gt.c
> > > index 390394bbaadc..5aa03f89a062 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > @@ -50,6 +50,7 @@
> > >  #include "xe_map.h"
> > >  #include "xe_migrate.h"
> > >  #include "xe_mmio.h"
> > > +#include "xe_pagefault.h"
> > >  #include "xe_pat.h"
> > >  #include "xe_pm.h"
> > >  #include "xe_mocs.h"
> > > @@ -846,6 +847,7 @@ static int gt_reset(struct xe_gt *gt)
> > >  
> > >         xe_uc_gucrc_disable(&gt->uc);
> > >         xe_uc_stop_prepare(&gt->uc);
> > > +       xe_pagefault_reset(gt_to_xe(gt), gt);
> > 
> > Can we just pass the GT in here and then extrapolate xe from there?
> > I
> > realize you're thinking of dropping the GT piece, but maybe we can
> > change the parameters around at that time. Just feels weird passing
> > these both in at this point.
> > 
> 
> I think the style is generally layer accepts the main argument first
> exported function. ofc we don't actually do that everywhere in Xe
> though. Here the main argument for the layer 'xe'.
> 
> Can drop if you prefer, I don't really have strong opinion.

Yeah ok makes sense. Let's leave it the way you have it here.

>  
> 
> > >         xe_gt_pagefault_reset(gt);
> > >  
> > >         xe_uc_stop(&gt->uc);
> > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> > > b/drivers/gpu/drm/xe/xe_pagefault.c
> > > index 14304c41eb23..aef389e51612 100644
> > > --- a/drivers/gpu/drm/xe/xe_pagefault.c
> > > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > > @@ -122,6 +122,24 @@ int xe_pagefault_init(struct xe_device *xe)
> > >         return err;
> > >  }
> > >  
> > > +static void xe_pagefault_queue_reset(struct xe_device *xe,
> > > struct
> > > xe_gt *gt,
> > > +                                    struct xe_pagefault_queue
> > > *pf_queue)
> > > +{
> > > +       u32 i;
> > > +
> > > +       /* Squash all pending faults on the GT */
> > > +
> > > +       spin_lock_irq(&pf_queue->lock);
> > > +       for (i = pf_queue->tail; i != pf_queue->head;
> > > +            i = (i + xe_pagefault_entry_size()) % pf_queue-
> > > >size) {
> > 
> > Should we add a check in here that pf_queue->head is some multiple
> > of
> > xe_pagefault_entry_size and pf_queue->size is aligned to
> > xe_pagefault_entry_size()?
> > 
> 
> We can add some asserts, still would get an infinite loop but at
> least
> CI we catch a bug quickly we introduced somewhere.

Yeah asserts is all I was thinking. Just to give us a way to detect
regressions in this area easily.

> 
> > > +               struct xe_pagefault *pf = pf_queue->data + i;
> > > +
> > > +               if (pf->gt == gt)
> > > +                       pf->gt = NULL;
> > 
> > Not sure I fully get the intent here... so we loop back around from
> > TAIL to HEAD and clear all of the GTs in pf_queue->data for each
> > one?
> > Is the expectation that each entry in the pf_queue has the same GT
> > or
> > is NULL? And then setting to NULL is a way we can abstract out the
> > GT?
> > 
> 
> This patch [1] moves the page faults queue from the GT to the device
> as
> having a thread pool per GT makes little sense - want our thread pool
> to
> be per device and enough threads to hit CPU<->GPU peak bus bandwidth
> in
> what we'd expect to be common prefetch / page faults cases (e.g., 2M
> SVM
> migrations).
> 
> When the fault queues were per GT, reset were easy, just reset all
> the
> queues on the GT but now sense they are shared on the device and
> individual GTs can reset, we squash all faults on the GT by setting
> the
> GT to NULL. This patch [2] ignores any fault with a NULL GT in the
> function xe_pagefault_queue_work.

Ok thanks for the explanation here. I also am still working through
those later patches, just an initial comment here. Let me read through
those other in detail and I'll get back if I still have questions
there.

Thanks,
Stuart

> 
> Matt
> 
> [1]
> https://patchwork.freedesktop.org/patch/667318/?series=152565&rev=1
> [2]
> https://patchwork.freedesktop.org/patch/667323/?series=152565&rev=1
> 
> > Still getting through the series, so appologize if this is also
> > answered later in the series...
> > 
> > Thanks,
> > Stuart
> > 
> > > +       }
> > > +       spin_unlock_irq(&pf_queue->lock);
> > > +}
> > > +
> > >  /**
> > >   * xe_pagefault_reset() - Page fault reset for a GT
> > >   * @xe: xe device instance
> > > @@ -132,7 +150,10 @@ int xe_pagefault_init(struct xe_device *xe)
> > >   */
> > >  void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
> > >  {
> > > -       /* TODO - implement */
> > > +       int i;
> > > +
> > > +       for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i)
> > > +               xe_pagefault_queue_reset(xe, gt, xe->usm.pf_queue
> > > +
> > > i);
> > >  }
> > >  
> > >  /**
> > 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/11] drm/xe: Stub out new pagefault layer
  2025-08-06  6:22 ` [PATCH 01/11] drm/xe: Stub out new pagefault layer Matthew Brost
  2025-08-06 23:01   ` Summers, Stuart
@ 2025-08-27 15:29   ` Francois Dugast
  2025-08-27 16:03     ` Matthew Brost
  2025-08-28 20:08   ` Summers, Stuart
  2 siblings, 1 reply; 51+ messages in thread
From: Francois Dugast @ 2025-08-27 15:29 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, thomas.hellstrom, himal.prasad.ghimiray, michal.mrozek

On Tue, Aug 05, 2025 at 11:22:32PM -0700, Matthew Brost wrote:
> Stub out the new page fault layer and add kernel documentation. This is
> intended as a replacement for the GT page fault layer, enabling multiple
> producers to hook into a shared page fault consumer interface.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile             |   1 +
>  drivers/gpu/drm/xe/xe_pagefault.c       |  63 ++++++++++++
>  drivers/gpu/drm/xe/xe_pagefault.h       |  19 ++++
>  drivers/gpu/drm/xe/xe_pagefault_types.h | 125 ++++++++++++++++++++++++
>  4 files changed, 208 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.c
>  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.h
>  create mode 100644 drivers/gpu/drm/xe/xe_pagefault_types.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index 8e0c3412a757..6fbebafe79c9 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -93,6 +93,7 @@ xe-y += xe_bb.o \
>  	xe_nvm.o \
>  	xe_oa.o \
>  	xe_observation.o \
> +	xe_pagefault.o \
>  	xe_pat.o \
>  	xe_pci.o \
>  	xe_pcode.o \
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
> new file mode 100644
> index 000000000000..3ce0e8d74b9d
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> @@ -0,0 +1,63 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#include "xe_pagefault.h"
> +#include "xe_pagefault_types.h"
> +
> +/**
> + * DOC: Xe page faults
> + *
> + * Xe page faults are handled in two layers. The producer layer interacts with
> + * hardware or firmware to receive and parse faults into struct xe_pagefault,
> + * then forwards them to the consumer. The consumer layer services the faults
> + * (e.g., memory migration, page table updates) and acknowledges the result back
> + * to the producer, which then forwards the results to the hardware or firmware.
> + * The consumer uses a page fault queue sized to absorb all potential faults and
> + * a multi-threaded worker to process them. Multiple producers are supported,
> + * with a single shared consumer.

I am not through with the series yet but xe_pagefault seems to be the
consumer code only, while the producer code will be located elsewhere
such as in xe_guc*. If so, might be good to write it here or in the
functions below.

> + */
> +
> +/**
> + * xe_pagefault_init() - Page fault init
> + * @xe: xe device instance
> + *
> + * Initialize Xe page fault state. Must be done after reading fuses.
> + *
> + * Return: 0 on Success, errno on failure
> + */
> +int xe_pagefault_init(struct xe_device *xe)
> +{
> +	/* TODO - implement */
> +	return 0;
> +}
> +
> +/**
> + * xe_pagefault_reset() - Page fault reset for a GT
> + * @xe: xe device instance
> + * @gt: GT being reset
> + *
> + * Reset the Xe page fault state for a GT; that is, squash any pending faults on
> + * the GT.
> + */
> +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
> +{
> +	/* TODO - implement */
> +}
> +
> +/**
> + * xe_pagefault_handler() - Page fault handler
> + * @xe: xe device instance
> + * @pf: Page fault
> + *
> + * Sink the page fault to a queue (i.e., a memory buffer) and queue a worker to
> + * service it. Safe to be called from IRQ or process context. Reclaim safe.
> + *
> + * Return: 0 on success, errno on failure
> + */
> +int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf)
> +{
> +	/* TODO - implement */
> +	return 0;
> +}
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.h b/drivers/gpu/drm/xe/xe_pagefault.h
> new file mode 100644
> index 000000000000..bd0cdf9ed37f
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_pagefault.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#ifndef _XE_PAGEFAULT_H_
> +#define _XE_PAGEFAULT_H_
> +
> +struct xe_device;
> +struct xe_gt;
> +struct xe_pagefault;
> +
> +int xe_pagefault_init(struct xe_device *xe);
> +
> +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt);
> +
> +int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf);
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h b/drivers/gpu/drm/xe/xe_pagefault_types.h
> new file mode 100644
> index 000000000000..fcff84f93dd8
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
> @@ -0,0 +1,125 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#ifndef _XE_PAGEFAULT_TYPES_H_
> +#define _XE_PAGEFAULT_TYPES_H_
> +
> +#include <linux/workqueue.h>
> +
> +struct xe_pagefault;
> +struct xe_gt;
> +
> +/** enum xe_pagefault_access_type - Xe page fault access type */
> +enum xe_pagefault_access_type {
> +	/** @XE_PAGEFAULT_ACCESS_TYPE_READ: Read access type */
> +	XE_PAGEFAULT_ACCESS_TYPE_READ	= 0,
> +	/** @XE_PAGEFAULT_ACCESS_TYPE_WRITE: Write access type */
> +	XE_PAGEFAULT_ACCESS_TYPE_WRITE	= 1,
> +	/** @XE_PAGEFAULT_ACCESS_TYPE_ATOMIC: Atomic access type */
> +	XE_PAGEFAULT_ACCESS_TYPE_ATOMIC	= 2,
> +};
> +
> +/** enum xe_pagefault_type - Xe page fault type */
> +enum xe_pagefault_type {
> +	/** @XE_PAGEFAULT_TYPE_NOT_PRESENT: Not present */
> +	XE_PAGEFAULT_TYPE_NOT_PRESENT		= 0,
> +	/** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Write access violation */
> +	XE_PAGEFAULT_WRITE_ACCESS_VIOLATION	= 1,
> +	/** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Atomic access violation */
> +	XE_PAGEFAULT_ATOMIC_ACCESS_VIOLATION	= 2,
> +};
> +
> +/** struct xe_pagefault_ops - Xe pagefault ops (producer) */
> +struct xe_pagefault_ops {
> +	/**
> +	 * @ack_fault: Ack fault
> +	 * @pf: Page fault
> +	 * @err: Error state of fault
> +	 *
> +	 * Page fault producer receives acknowledgment from the consumer and
> +	 * sends the result to the HW/FW interface.
> +	 */
> +	void (*ack_fault)(struct xe_pagefault *pf, int err);
> +};
> +
> +/**
> + * struct xe_pagefault - Xe page fault
> + *
> + * Generic page fault structure for communication between producer and consumer.
> + * Carefully sized to be 64 bytes.
> + */
> +struct xe_pagefault {
> +	/**
> +	 * @gt: GT of fault
> +	 *
> +	 * XXX: We may want to decouple the GT from individual faults, as it's
> +	 * unclear whether future platforms will always have a GT for all page
> +	 * fault producers. Internally, the GT is used for stats, identifying
> +	 * the appropriate VRAM region, and locating the migration queue.
> +	 * Leaving this as-is for now, but we can revisit later to see if we
> +	 * can convert it to use the Xe device pointer instead.
> +	 */
> +	struct xe_gt *gt;
> +	/**
> +	 * @consumer: State for the software handling the fault. Populated by
> +	 * the producer and may be modified by the consumer to communicate
> +	 * information back to the producer upon fault acknowledgment.
> +	 */
> +	struct {
> +		/** @consumer.page_addr: address of page fault */
> +		u64 page_addr;
> +		/** @consumer.asid: address space ID */
> +		u32 asid;
> +		/** @consumer.access_type: access type */
> +		u8 access_type;

For type safety we shoud use enum xe_pagefault_access_type instead of u8.

> +		/** @consumer.fault_type: fault type */
> +		u8 fault_type;

Same here with enum xe_pagefault_type instead of u8.

> +#define XE_PAGEFAULT_LEVEL_NACK		0xff	/* Producer indicates nack fault */
> +		/** @consumer.fault_level: fault level */
> +		u8 fault_level;
> +		/** @consumer.engine_class: engine class */
> +		u8 engine_class;
> +		/** consumer.reserved: reserved bits for future expansion */
> +		u64 reserved;
> +	} consumer;
> +	/**
> +	 * @producer: State for the producer (i.e., HW/FW interface). Populated
> +	 * by the producer and should not be modified—or even inspected—by the
> +	 * consumer, except for calling operations.
> +	 */
> +	struct {
> +		/** @producer.private: private pointer */
> +		void *private;
> +		/** @producer.ops: operations */
> +		const struct xe_pagefault_ops *ops;
> +#define XE_PAGEFAULT_PRODUCER_MSG_LEN_DW	4
> +		/**
> +		 * producer.msg: page fault message, used by producer in fault

s/producer.msg/@producer.msg/

> +		 * acknowledgement to formulate response to HW/FW interface.

s/acknowledgement/acknowledgment/

> +		 */
> +		u32 msg[XE_PAGEFAULT_PRODUCER_MSG_LEN_DW];

It is clear from patch #6 why it is more convenient to have this in
struct xe_pagefault rather than local to the producer but this seems
to go a bit against the elegant abstraction provided by this new page
fault layers. producer.private could store a struct with {guc,msg}
but that is probably overengineering so up to you.

Francois

> +	} producer;
> +};
> +
> +/** struct xe_pagefault_queue: Xe pagefault queue (consumer) */
> +struct xe_pagefault_queue {
> +	/**
> +	 * @data: Data in queue containing struct xe_pagefault, protected by
> +	 * @lock
> +	 */
> +	void *data;
> +	/** @size: Size of queue in bytes */
> +	u32 size;
> +	/** @head: Head pointer in bytes, moved by producer, protected by @lock */
> +	u32 head;
> +	/** @tail: Tail pointer in bytes, moved by consumer, protected by @lock */
> +	u32 tail;
> +	/** @lock: protects page fault queue */
> +	spinlock_t lock;
> +	/** @worker: to process page faults */
> +	struct work_struct worker;
> +};
> +
> +#endif
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/11] drm/xe: Stub out new pagefault layer
  2025-08-27 15:29   ` Francois Dugast
@ 2025-08-27 16:03     ` Matthew Brost
  2025-08-27 16:25       ` Francois Dugast
  2025-08-27 18:00       ` Matthew Brost
  0 siblings, 2 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-27 16:03 UTC (permalink / raw)
  To: Francois Dugast
  Cc: intel-xe, thomas.hellstrom, himal.prasad.ghimiray, michal.mrozek

On Wed, Aug 27, 2025 at 05:29:46PM +0200, Francois Dugast wrote:
> On Tue, Aug 05, 2025 at 11:22:32PM -0700, Matthew Brost wrote:
> > Stub out the new page fault layer and add kernel documentation. This is
> > intended as a replacement for the GT page fault layer, enabling multiple
> > producers to hook into a shared page fault consumer interface.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile             |   1 +
> >  drivers/gpu/drm/xe/xe_pagefault.c       |  63 ++++++++++++
> >  drivers/gpu/drm/xe/xe_pagefault.h       |  19 ++++
> >  drivers/gpu/drm/xe/xe_pagefault_types.h | 125 ++++++++++++++++++++++++
> >  4 files changed, 208 insertions(+)
> >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.c
> >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.h
> >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault_types.h
> > 
> > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > index 8e0c3412a757..6fbebafe79c9 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -93,6 +93,7 @@ xe-y += xe_bb.o \
> >  	xe_nvm.o \
> >  	xe_oa.o \
> >  	xe_observation.o \
> > +	xe_pagefault.o \
> >  	xe_pat.o \
> >  	xe_pci.o \
> >  	xe_pcode.o \
> > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
> > new file mode 100644
> > index 000000000000..3ce0e8d74b9d
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > @@ -0,0 +1,63 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2025 Intel Corporation
> > + */
> > +
> > +#include "xe_pagefault.h"
> > +#include "xe_pagefault_types.h"
> > +
> > +/**
> > + * DOC: Xe page faults
> > + *
> > + * Xe page faults are handled in two layers. The producer layer interacts with
> > + * hardware or firmware to receive and parse faults into struct xe_pagefault,
> > + * then forwards them to the consumer. The consumer layer services the faults
> > + * (e.g., memory migration, page table updates) and acknowledges the result back
> > + * to the producer, which then forwards the results to the hardware or firmware.
> > + * The consumer uses a page fault queue sized to absorb all potential faults and
> > + * a multi-threaded worker to process them. Multiple producers are supported,
> > + * with a single shared consumer.
> 
> I am not through with the series yet but xe_pagefault seems to be the
> consumer code only, while the producer code will be located elsewhere
> such as in xe_guc*. If so, might be good to write it here or in the
> functions below.
> 

I didn't want to mention the GuC specifically as it is intended to be
generic. The GuC is a firmware, which is called out as potenial
producers.

> > + */
> > +
> > +/**
> > + * xe_pagefault_init() - Page fault init
> > + * @xe: xe device instance
> > + *
> > + * Initialize Xe page fault state. Must be done after reading fuses.
> > + *
> > + * Return: 0 on Success, errno on failure
> > + */
> > +int xe_pagefault_init(struct xe_device *xe)
> > +{
> > +	/* TODO - implement */
> > +	return 0;
> > +}
> > +
> > +/**
> > + * xe_pagefault_reset() - Page fault reset for a GT
> > + * @xe: xe device instance
> > + * @gt: GT being reset
> > + *
> > + * Reset the Xe page fault state for a GT; that is, squash any pending faults on
> > + * the GT.
> > + */
> > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
> > +{
> > +	/* TODO - implement */
> > +}
> > +
> > +/**
> > + * xe_pagefault_handler() - Page fault handler
> > + * @xe: xe device instance
> > + * @pf: Page fault
> > + *
> > + * Sink the page fault to a queue (i.e., a memory buffer) and queue a worker to
> > + * service it. Safe to be called from IRQ or process context. Reclaim safe.
> > + *
> > + * Return: 0 on success, errno on failure
> > + */
> > +int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf)
> > +{
> > +	/* TODO - implement */
> > +	return 0;
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_pagefault.h b/drivers/gpu/drm/xe/xe_pagefault.h
> > new file mode 100644
> > index 000000000000..bd0cdf9ed37f
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_pagefault.h
> > @@ -0,0 +1,19 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2025 Intel Corporation
> > + */
> > +
> > +#ifndef _XE_PAGEFAULT_H_
> > +#define _XE_PAGEFAULT_H_
> > +
> > +struct xe_device;
> > +struct xe_gt;
> > +struct xe_pagefault;
> > +
> > +int xe_pagefault_init(struct xe_device *xe);
> > +
> > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt);
> > +
> > +int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf);
> > +
> > +#endif
> > diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > new file mode 100644
> > index 000000000000..fcff84f93dd8
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > @@ -0,0 +1,125 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2025 Intel Corporation
> > + */
> > +
> > +#ifndef _XE_PAGEFAULT_TYPES_H_
> > +#define _XE_PAGEFAULT_TYPES_H_
> > +
> > +#include <linux/workqueue.h>
> > +
> > +struct xe_pagefault;
> > +struct xe_gt;
> > +
> > +/** enum xe_pagefault_access_type - Xe page fault access type */
> > +enum xe_pagefault_access_type {
> > +	/** @XE_PAGEFAULT_ACCESS_TYPE_READ: Read access type */
> > +	XE_PAGEFAULT_ACCESS_TYPE_READ	= 0,
> > +	/** @XE_PAGEFAULT_ACCESS_TYPE_WRITE: Write access type */
> > +	XE_PAGEFAULT_ACCESS_TYPE_WRITE	= 1,
> > +	/** @XE_PAGEFAULT_ACCESS_TYPE_ATOMIC: Atomic access type */
> > +	XE_PAGEFAULT_ACCESS_TYPE_ATOMIC	= 2,
> > +};
> > +
> > +/** enum xe_pagefault_type - Xe page fault type */
> > +enum xe_pagefault_type {
> > +	/** @XE_PAGEFAULT_TYPE_NOT_PRESENT: Not present */
> > +	XE_PAGEFAULT_TYPE_NOT_PRESENT		= 0,
> > +	/** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Write access violation */
> > +	XE_PAGEFAULT_WRITE_ACCESS_VIOLATION	= 1,
> > +	/** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Atomic access violation */
> > +	XE_PAGEFAULT_ATOMIC_ACCESS_VIOLATION	= 2,
> > +};
> > +
> > +/** struct xe_pagefault_ops - Xe pagefault ops (producer) */
> > +struct xe_pagefault_ops {
> > +	/**
> > +	 * @ack_fault: Ack fault
> > +	 * @pf: Page fault
> > +	 * @err: Error state of fault
> > +	 *
> > +	 * Page fault producer receives acknowledgment from the consumer and
> > +	 * sends the result to the HW/FW interface.
> > +	 */
> > +	void (*ack_fault)(struct xe_pagefault *pf, int err);
> > +};
> > +
> > +/**
> > + * struct xe_pagefault - Xe page fault
> > + *
> > + * Generic page fault structure for communication between producer and consumer.
> > + * Carefully sized to be 64 bytes.
> > + */
> > +struct xe_pagefault {
> > +	/**
> > +	 * @gt: GT of fault
> > +	 *
> > +	 * XXX: We may want to decouple the GT from individual faults, as it's
> > +	 * unclear whether future platforms will always have a GT for all page
> > +	 * fault producers. Internally, the GT is used for stats, identifying
> > +	 * the appropriate VRAM region, and locating the migration queue.
> > +	 * Leaving this as-is for now, but we can revisit later to see if we
> > +	 * can convert it to use the Xe device pointer instead.
> > +	 */
> > +	struct xe_gt *gt;
> > +	/**
> > +	 * @consumer: State for the software handling the fault. Populated by
> > +	 * the producer and may be modified by the consumer to communicate
> > +	 * information back to the producer upon fault acknowledgment.
> > +	 */
> > +	struct {
> > +		/** @consumer.page_addr: address of page fault */
> > +		u64 page_addr;
> > +		/** @consumer.asid: address space ID */
> > +		u32 asid;
> > +		/** @consumer.access_type: access type */
> > +		u8 access_type;
> 
> For type safety we shoud use enum xe_pagefault_access_type instead of u8.
> 
> > +		/** @consumer.fault_type: fault type */
> > +		u8 fault_type;
> 
> Same here with enum xe_pagefault_type instead of u8.
> 

This structure is carefully sized to 64 bytes. Unsure if using enums
here will throw off that sizing. I'll check on this, but I suspect it
will, if it does I think u8 is the correct choice.

> > +#define XE_PAGEFAULT_LEVEL_NACK		0xff	/* Producer indicates nack fault */
> > +		/** @consumer.fault_level: fault level */
> > +		u8 fault_level;
> > +		/** @consumer.engine_class: engine class */
> > +		u8 engine_class;
> > +		/** consumer.reserved: reserved bits for future expansion */
> > +		u64 reserved;
> > +	} consumer;
> > +	/**
> > +	 * @producer: State for the producer (i.e., HW/FW interface). Populated
> > +	 * by the producer and should not be modified—or even inspected—by the
> > +	 * consumer, except for calling operations.
> > +	 */
> > +	struct {
> > +		/** @producer.private: private pointer */
> > +		void *private;
> > +		/** @producer.ops: operations */
> > +		const struct xe_pagefault_ops *ops;
> > +#define XE_PAGEFAULT_PRODUCER_MSG_LEN_DW	4
> > +		/**
> > +		 * producer.msg: page fault message, used by producer in fault
> 
> s/producer.msg/@producer.msg/
> 

+1

> > +		 * acknowledgement to formulate response to HW/FW interface.
> 
> s/acknowledgement/acknowledgment/
> 

+1

> > +		 */
> > +		u32 msg[XE_PAGEFAULT_PRODUCER_MSG_LEN_DW];
> 
> It is clear from patch #6 why it is more convenient to have this in
> struct xe_pagefault rather than local to the producer but this seems
> to go a bit against the elegant abstraction provided by this new page
> fault layers. producer.private could store a struct with {guc,msg}
> but that is probably overengineering so up to you.
> 

You can't malloc (with GFP_KERNEL) the producer path, at least for the
GuC as we are either in softIRQ context or under the CT lock which is
the path of reclaim. The only option is have these fields populated in
the stack xe_pagefault structure in the producer which then is copied
into a preallocated page fault queue, sized to sink all possible faults
on the device, on the consumer size. I can add kernel doc explaining
this.

Matt

> Francois
> 
> > +	} producer;
> > +};
> > +
> > +/** struct xe_pagefault_queue: Xe pagefault queue (consumer) */
> > +struct xe_pagefault_queue {
> > +	/**
> > +	 * @data: Data in queue containing struct xe_pagefault, protected by
> > +	 * @lock
> > +	 */
> > +	void *data;
> > +	/** @size: Size of queue in bytes */
> > +	u32 size;
> > +	/** @head: Head pointer in bytes, moved by producer, protected by @lock */
> > +	u32 head;
> > +	/** @tail: Tail pointer in bytes, moved by consumer, protected by @lock */
> > +	u32 tail;
> > +	/** @lock: protects page fault queue */
> > +	spinlock_t lock;
> > +	/** @worker: to process page faults */
> > +	struct work_struct worker;
> > +};
> > +
> > +#endif
> > -- 
> > 2.34.1
> > 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/11] drm/xe: Stub out new pagefault layer
  2025-08-27 16:03     ` Matthew Brost
@ 2025-08-27 16:25       ` Francois Dugast
  2025-08-27 16:40         ` Matthew Brost
  2025-08-27 18:00       ` Matthew Brost
  1 sibling, 1 reply; 51+ messages in thread
From: Francois Dugast @ 2025-08-27 16:25 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, thomas.hellstrom, himal.prasad.ghimiray, michal.mrozek

On Wed, Aug 27, 2025 at 09:03:57AM -0700, Matthew Brost wrote:
> On Wed, Aug 27, 2025 at 05:29:46PM +0200, Francois Dugast wrote:
> > On Tue, Aug 05, 2025 at 11:22:32PM -0700, Matthew Brost wrote:
> > > Stub out the new page fault layer and add kernel documentation. This is
> > > intended as a replacement for the GT page fault layer, enabling multiple
> > > producers to hook into a shared page fault consumer interface.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/Makefile             |   1 +
> > >  drivers/gpu/drm/xe/xe_pagefault.c       |  63 ++++++++++++
> > >  drivers/gpu/drm/xe/xe_pagefault.h       |  19 ++++
> > >  drivers/gpu/drm/xe/xe_pagefault_types.h | 125 ++++++++++++++++++++++++
> > >  4 files changed, 208 insertions(+)
> > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.c
> > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.h
> > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault_types.h
> > > 
> > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > > index 8e0c3412a757..6fbebafe79c9 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -93,6 +93,7 @@ xe-y += xe_bb.o \
> > >  	xe_nvm.o \
> > >  	xe_oa.o \
> > >  	xe_observation.o \
> > > +	xe_pagefault.o \
> > >  	xe_pat.o \
> > >  	xe_pci.o \
> > >  	xe_pcode.o \
> > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
> > > new file mode 100644
> > > index 000000000000..3ce0e8d74b9d
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > > @@ -0,0 +1,63 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2025 Intel Corporation
> > > + */
> > > +
> > > +#include "xe_pagefault.h"
> > > +#include "xe_pagefault_types.h"
> > > +
> > > +/**
> > > + * DOC: Xe page faults
> > > + *
> > > + * Xe page faults are handled in two layers. The producer layer interacts with
> > > + * hardware or firmware to receive and parse faults into struct xe_pagefault,
> > > + * then forwards them to the consumer. The consumer layer services the faults
> > > + * (e.g., memory migration, page table updates) and acknowledges the result back
> > > + * to the producer, which then forwards the results to the hardware or firmware.
> > > + * The consumer uses a page fault queue sized to absorb all potential faults and
> > > + * a multi-threaded worker to process them. Multiple producers are supported,
> > > + * with a single shared consumer.
> > 
> > I am not through with the series yet but xe_pagefault seems to be the
> > consumer code only, while the producer code will be located elsewhere
> > such as in xe_guc*. If so, might be good to write it here or in the
> > functions below.
> > 
> 
> I didn't want to mention the GuC specifically as it is intended to be
> generic. The GuC is a firmware, which is called out as potenial
> producers.

Sure, sorry for the confusion, I did not mean to name the producer here
but rather to add something like: "This file contains the consumer code."
to link the doc and the functions below.

> 
> > > + */
> > > +
> > > +/**
> > > + * xe_pagefault_init() - Page fault init
> > > + * @xe: xe device instance
> > > + *
> > > + * Initialize Xe page fault state. Must be done after reading fuses.
> > > + *
> > > + * Return: 0 on Success, errno on failure
> > > + */
> > > +int xe_pagefault_init(struct xe_device *xe)
> > > +{
> > > +	/* TODO - implement */
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * xe_pagefault_reset() - Page fault reset for a GT
> > > + * @xe: xe device instance
> > > + * @gt: GT being reset
> > > + *
> > > + * Reset the Xe page fault state for a GT; that is, squash any pending faults on
> > > + * the GT.
> > > + */
> > > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
> > > +{
> > > +	/* TODO - implement */
> > > +}
> > > +
> > > +/**
> > > + * xe_pagefault_handler() - Page fault handler
> > > + * @xe: xe device instance
> > > + * @pf: Page fault
> > > + *
> > > + * Sink the page fault to a queue (i.e., a memory buffer) and queue a worker to
> > > + * service it. Safe to be called from IRQ or process context. Reclaim safe.
> > > + *
> > > + * Return: 0 on success, errno on failure
> > > + */
> > > +int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf)
> > > +{
> > > +	/* TODO - implement */
> > > +	return 0;
> > > +}
> > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.h b/drivers/gpu/drm/xe/xe_pagefault.h
> > > new file mode 100644
> > > index 000000000000..bd0cdf9ed37f
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_pagefault.h
> > > @@ -0,0 +1,19 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2025 Intel Corporation
> > > + */
> > > +
> > > +#ifndef _XE_PAGEFAULT_H_
> > > +#define _XE_PAGEFAULT_H_
> > > +
> > > +struct xe_device;
> > > +struct xe_gt;
> > > +struct xe_pagefault;
> > > +
> > > +int xe_pagefault_init(struct xe_device *xe);
> > > +
> > > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt);
> > > +
> > > +int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf);
> > > +
> > > +#endif
> > > diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > new file mode 100644
> > > index 000000000000..fcff84f93dd8
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > @@ -0,0 +1,125 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2025 Intel Corporation
> > > + */
> > > +
> > > +#ifndef _XE_PAGEFAULT_TYPES_H_
> > > +#define _XE_PAGEFAULT_TYPES_H_
> > > +
> > > +#include <linux/workqueue.h>
> > > +
> > > +struct xe_pagefault;
> > > +struct xe_gt;
> > > +
> > > +/** enum xe_pagefault_access_type - Xe page fault access type */
> > > +enum xe_pagefault_access_type {
> > > +	/** @XE_PAGEFAULT_ACCESS_TYPE_READ: Read access type */
> > > +	XE_PAGEFAULT_ACCESS_TYPE_READ	= 0,
> > > +	/** @XE_PAGEFAULT_ACCESS_TYPE_WRITE: Write access type */
> > > +	XE_PAGEFAULT_ACCESS_TYPE_WRITE	= 1,
> > > +	/** @XE_PAGEFAULT_ACCESS_TYPE_ATOMIC: Atomic access type */
> > > +	XE_PAGEFAULT_ACCESS_TYPE_ATOMIC	= 2,
> > > +};
> > > +
> > > +/** enum xe_pagefault_type - Xe page fault type */
> > > +enum xe_pagefault_type {
> > > +	/** @XE_PAGEFAULT_TYPE_NOT_PRESENT: Not present */
> > > +	XE_PAGEFAULT_TYPE_NOT_PRESENT		= 0,
> > > +	/** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Write access violation */
> > > +	XE_PAGEFAULT_WRITE_ACCESS_VIOLATION	= 1,
> > > +	/** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Atomic access violation */
> > > +	XE_PAGEFAULT_ATOMIC_ACCESS_VIOLATION	= 2,
> > > +};
> > > +
> > > +/** struct xe_pagefault_ops - Xe pagefault ops (producer) */
> > > +struct xe_pagefault_ops {
> > > +	/**
> > > +	 * @ack_fault: Ack fault
> > > +	 * @pf: Page fault
> > > +	 * @err: Error state of fault
> > > +	 *
> > > +	 * Page fault producer receives acknowledgment from the consumer and
> > > +	 * sends the result to the HW/FW interface.
> > > +	 */
> > > +	void (*ack_fault)(struct xe_pagefault *pf, int err);
> > > +};
> > > +
> > > +/**
> > > + * struct xe_pagefault - Xe page fault
> > > + *
> > > + * Generic page fault structure for communication between producer and consumer.
> > > + * Carefully sized to be 64 bytes.
> > > + */
> > > +struct xe_pagefault {
> > > +	/**
> > > +	 * @gt: GT of fault
> > > +	 *
> > > +	 * XXX: We may want to decouple the GT from individual faults, as it's
> > > +	 * unclear whether future platforms will always have a GT for all page
> > > +	 * fault producers. Internally, the GT is used for stats, identifying
> > > +	 * the appropriate VRAM region, and locating the migration queue.
> > > +	 * Leaving this as-is for now, but we can revisit later to see if we
> > > +	 * can convert it to use the Xe device pointer instead.
> > > +	 */
> > > +	struct xe_gt *gt;
> > > +	/**
> > > +	 * @consumer: State for the software handling the fault. Populated by
> > > +	 * the producer and may be modified by the consumer to communicate
> > > +	 * information back to the producer upon fault acknowledgment.
> > > +	 */
> > > +	struct {
> > > +		/** @consumer.page_addr: address of page fault */
> > > +		u64 page_addr;
> > > +		/** @consumer.asid: address space ID */
> > > +		u32 asid;
> > > +		/** @consumer.access_type: access type */
> > > +		u8 access_type;
> > 
> > For type safety we shoud use enum xe_pagefault_access_type instead of u8.
> > 
> > > +		/** @consumer.fault_type: fault type */
> > > +		u8 fault_type;
> > 
> > Same here with enum xe_pagefault_type instead of u8.
> > 
> 
> This structure is carefully sized to 64 bytes. Unsure if using enums
> here will throw off that sizing. I'll check on this, but I suspect it
> will, if it does I think u8 is the correct choice.
> 
> > > +#define XE_PAGEFAULT_LEVEL_NACK		0xff	/* Producer indicates nack fault */
> > > +		/** @consumer.fault_level: fault level */
> > > +		u8 fault_level;
> > > +		/** @consumer.engine_class: engine class */
> > > +		u8 engine_class;
> > > +		/** consumer.reserved: reserved bits for future expansion */
> > > +		u64 reserved;
> > > +	} consumer;
> > > +	/**
> > > +	 * @producer: State for the producer (i.e., HW/FW interface). Populated
> > > +	 * by the producer and should not be modified—or even inspected—by the
> > > +	 * consumer, except for calling operations.
> > > +	 */
> > > +	struct {
> > > +		/** @producer.private: private pointer */
> > > +		void *private;
> > > +		/** @producer.ops: operations */
> > > +		const struct xe_pagefault_ops *ops;
> > > +#define XE_PAGEFAULT_PRODUCER_MSG_LEN_DW	4
> > > +		/**
> > > +		 * producer.msg: page fault message, used by producer in fault
> > 
> > s/producer.msg/@producer.msg/
> > 
> 
> +1
> 
> > > +		 * acknowledgement to formulate response to HW/FW interface.
> > 
> > s/acknowledgement/acknowledgment/
> > 
> 
> +1
> 
> > > +		 */
> > > +		u32 msg[XE_PAGEFAULT_PRODUCER_MSG_LEN_DW];
> > 
> > It is clear from patch #6 why it is more convenient to have this in
> > struct xe_pagefault rather than local to the producer but this seems
> > to go a bit against the elegant abstraction provided by this new page
> > fault layers. producer.private could store a struct with {guc,msg}
> > but that is probably overengineering so up to you.
> > 
> 
> You can't malloc (with GFP_KERNEL) the producer path, at least for the
> GuC as we are either in softIRQ context or under the CT lock which is
> the path of reclaim. The only option is have these fields populated in
> the stack xe_pagefault structure in the producer which then is copied
> into a preallocated page fault queue, sized to sink all possible faults
> on the device, on the consumer size. I can add kernel doc explaining
> this.

Thanks for the explanation, would be helpful as kernel doc indeed.

Francois

> 
> Matt
> 
> > Francois
> > 
> > > +	} producer;
> > > +};
> > > +
> > > +/** struct xe_pagefault_queue: Xe pagefault queue (consumer) */
> > > +struct xe_pagefault_queue {
> > > +	/**
> > > +	 * @data: Data in queue containing struct xe_pagefault, protected by
> > > +	 * @lock
> > > +	 */
> > > +	void *data;
> > > +	/** @size: Size of queue in bytes */
> > > +	u32 size;
> > > +	/** @head: Head pointer in bytes, moved by producer, protected by @lock */
> > > +	u32 head;
> > > +	/** @tail: Tail pointer in bytes, moved by consumer, protected by @lock */
> > > +	u32 tail;
> > > +	/** @lock: protects page fault queue */
> > > +	spinlock_t lock;
> > > +	/** @worker: to process page faults */
> > > +	struct work_struct worker;
> > > +};
> > > +
> > > +#endif
> > > -- 
> > > 2.34.1
> > > 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/11] drm/xe: Implement xe_pagefault_init
  2025-08-06  6:22 ` [PATCH 02/11] drm/xe: Implement xe_pagefault_init Matthew Brost
  2025-08-06 23:08   ` Summers, Stuart
@ 2025-08-27 16:30   ` Francois Dugast
  2025-08-27 16:49     ` Matthew Brost
  2025-08-28 20:10   ` Summers, Stuart
  2 siblings, 1 reply; 51+ messages in thread
From: Francois Dugast @ 2025-08-27 16:30 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, thomas.hellstrom, himal.prasad.ghimiray, michal.mrozek

On Tue, Aug 05, 2025 at 11:22:33PM -0700, Matthew Brost wrote:
> Create pagefault queues and initialize them.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device.c       |  5 ++
>  drivers/gpu/drm/xe/xe_device_types.h |  6 ++
>  drivers/gpu/drm/xe/xe_pagefault.c    | 93 +++++++++++++++++++++++++++-
>  3 files changed, 102 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> index 57edbc63da6f..c7c8aee03841 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -50,6 +50,7 @@
>  #include "xe_nvm.h"
>  #include "xe_oa.h"
>  #include "xe_observation.h"
> +#include "xe_pagefault.h"
>  #include "xe_pat.h"
>  #include "xe_pcode.h"
>  #include "xe_pm.h"
> @@ -890,6 +891,10 @@ int xe_device_probe(struct xe_device *xe)
>  	if (err)
>  		return err;
>  
> +	err = xe_pagefault_init(xe);
> +	if (err)
> +		return err;
> +
>  	xe_nvm_init(xe);
>  
>  	err = xe_heci_gsc_init(xe);
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> index 01e8fa0d2f9f..6aa119026ce9 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -17,6 +17,7 @@
>  #include "xe_lmtt_types.h"
>  #include "xe_memirq_types.h"
>  #include "xe_oa_types.h"
> +#include "xe_pagefault_types.h"
>  #include "xe_platform_types.h"
>  #include "xe_pmu_types.h"
>  #include "xe_pt_types.h"
> @@ -394,6 +395,11 @@ struct xe_device {
>  		u32 next_asid;
>  		/** @usm.lock: protects UM state */
>  		struct rw_semaphore lock;
> +		/** @usm.pf_wq: page fault work queue, unbound, high priority */
> +		struct workqueue_struct *pf_wq;
> +#define XE_PAGEFAULT_QUEUE_COUNT	4

Could we add a comment why value is 4?

> +		/** @pf_queue: Page fault queues */

s/@pf_queue/@usm.pf_queue/

Rest LGTM.

Francois

> +		struct xe_pagefault_queue pf_queue[XE_PAGEFAULT_QUEUE_COUNT];
>  	} usm;
>  
>  	/** @pinned: pinned BO state */
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
> index 3ce0e8d74b9d..14304c41eb23 100644
> --- a/drivers/gpu/drm/xe/xe_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> @@ -3,6 +3,10 @@
>   * Copyright © 2025 Intel Corporation
>   */
>  
> +#include <drm/drm_managed.h>
> +
> +#include "xe_device.h"
> +#include "xe_gt_types.h"
>  #include "xe_pagefault.h"
>  #include "xe_pagefault_types.h"
>  
> @@ -19,6 +23,71 @@
>   * with a single shared consumer.
>   */
>  
> +static int xe_pagefault_entry_size(void)
> +{
> +	return roundup_pow_of_two(sizeof(struct xe_pagefault));
> +}
> +
> +static void xe_pagefault_queue_work(struct work_struct *w)
> +{
> +	/* TODO: Implement */
> +}
> +
> +static int xe_pagefault_queue_init(struct xe_device *xe,
> +				   struct xe_pagefault_queue *pf_queue)
> +{
> +	struct xe_gt *gt;
> +	int total_num_eus = 0;
> +	u8 id;
> +
> +	for_each_gt(gt, xe, id) {
> +		xe_dss_mask_t all_dss;
> +		int num_dss, num_eus;
> +
> +		bitmap_or(all_dss, gt->fuse_topo.g_dss_mask,
> +			  gt->fuse_topo.c_dss_mask, XE_MAX_DSS_FUSE_BITS);
> +
> +		num_dss = bitmap_weight(all_dss, XE_MAX_DSS_FUSE_BITS);
> +		num_eus = bitmap_weight(gt->fuse_topo.eu_mask_per_dss,
> +					XE_MAX_EU_FUSE_BITS) * num_dss;
> +
> +		total_num_eus += num_eus;
> +	}
> +
> +	xe_assert(xe, total_num_eus);
> +
> +	/*
> +	 * user can issue separate page faults per EU and per CS
> +	 *
> +	 * XXX: Multiplier required as compute UMD are getting PF queue errors
> +	 * without it. Follow on why this multiplier is required.
> +	 */
> +#define PF_MULTIPLIER	8
> +	pf_queue->size = (total_num_eus + XE_NUM_HW_ENGINES) *
> +		xe_pagefault_entry_size() * PF_MULTIPLIER;
> +	pf_queue->size = roundup_pow_of_two(pf_queue->size);
> +#undef PF_MULTIPLIER
> +
> +	drm_dbg(&xe->drm, "xe_pagefault_entry_size=%d, total_num_eus=%d, pf_queue->size=%u",
> +		xe_pagefault_entry_size(), total_num_eus, pf_queue->size);
> +
> +	pf_queue->data = devm_kzalloc(xe->drm.dev, pf_queue->size, GFP_KERNEL);
> +	if (!pf_queue->data)
> +		return -ENOMEM;
> +
> +	spin_lock_init(&pf_queue->lock);
> +	INIT_WORK(&pf_queue->worker, xe_pagefault_queue_work);
> +
> +	return 0;
> +}
> +
> +static void xe_pagefault_fini(void *arg)
> +{
> +	struct xe_device *xe = arg;
> +
> +	destroy_workqueue(xe->usm.pf_wq);
> +}
> +
>  /**
>   * xe_pagefault_init() - Page fault init
>   * @xe: xe device instance
> @@ -29,8 +98,28 @@
>   */
>  int xe_pagefault_init(struct xe_device *xe)
>  {
> -	/* TODO - implement */
> -	return 0;
> +	int err, i;
> +
> +	if (!xe->info.has_usm)
> +		return 0;
> +
> +	xe->usm.pf_wq = alloc_workqueue("xe_page_fault_work_queue",
> +					WQ_UNBOUND | WQ_HIGHPRI,
> +					XE_PAGEFAULT_QUEUE_COUNT);
> +	if (!xe->usm.pf_wq)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i) {
> +		err = xe_pagefault_queue_init(xe, xe->usm.pf_queue + i);
> +		if (err)
> +			goto err_out;
> +	}
> +
> +	return devm_add_action_or_reset(xe->drm.dev, xe_pagefault_fini, xe);
> +
> +err_out:
> +	destroy_workqueue(xe->usm.pf_wq);
> +	return err;
>  }
>  
>  /**
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/11] drm/xe: Stub out new pagefault layer
  2025-08-27 16:25       ` Francois Dugast
@ 2025-08-27 16:40         ` Matthew Brost
  0 siblings, 0 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-27 16:40 UTC (permalink / raw)
  To: Francois Dugast
  Cc: intel-xe, thomas.hellstrom, himal.prasad.ghimiray, michal.mrozek

On Wed, Aug 27, 2025 at 06:25:49PM +0200, Francois Dugast wrote:
> On Wed, Aug 27, 2025 at 09:03:57AM -0700, Matthew Brost wrote:
> > On Wed, Aug 27, 2025 at 05:29:46PM +0200, Francois Dugast wrote:
> > > On Tue, Aug 05, 2025 at 11:22:32PM -0700, Matthew Brost wrote:
> > > > Stub out the new page fault layer and add kernel documentation. This is
> > > > intended as a replacement for the GT page fault layer, enabling multiple
> > > > producers to hook into a shared page fault consumer interface.
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >  drivers/gpu/drm/xe/Makefile             |   1 +
> > > >  drivers/gpu/drm/xe/xe_pagefault.c       |  63 ++++++++++++
> > > >  drivers/gpu/drm/xe/xe_pagefault.h       |  19 ++++
> > > >  drivers/gpu/drm/xe/xe_pagefault_types.h | 125 ++++++++++++++++++++++++
> > > >  4 files changed, 208 insertions(+)
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.c
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.h
> > > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault_types.h
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > > > index 8e0c3412a757..6fbebafe79c9 100644
> > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > @@ -93,6 +93,7 @@ xe-y += xe_bb.o \
> > > >  	xe_nvm.o \
> > > >  	xe_oa.o \
> > > >  	xe_observation.o \
> > > > +	xe_pagefault.o \
> > > >  	xe_pat.o \
> > > >  	xe_pci.o \
> > > >  	xe_pcode.o \
> > > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
> > > > new file mode 100644
> > > > index 000000000000..3ce0e8d74b9d
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > > > @@ -0,0 +1,63 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2025 Intel Corporation
> > > > + */
> > > > +
> > > > +#include "xe_pagefault.h"
> > > > +#include "xe_pagefault_types.h"
> > > > +
> > > > +/**
> > > > + * DOC: Xe page faults
> > > > + *
> > > > + * Xe page faults are handled in two layers. The producer layer interacts with
> > > > + * hardware or firmware to receive and parse faults into struct xe_pagefault,
> > > > + * then forwards them to the consumer. The consumer layer services the faults
> > > > + * (e.g., memory migration, page table updates) and acknowledges the result back
> > > > + * to the producer, which then forwards the results to the hardware or firmware.
> > > > + * The consumer uses a page fault queue sized to absorb all potential faults and
> > > > + * a multi-threaded worker to process them. Multiple producers are supported,
> > > > + * with a single shared consumer.
> > > 
> > > I am not through with the series yet but xe_pagefault seems to be the
> > > consumer code only, while the producer code will be located elsewhere
> > > such as in xe_guc*. If so, might be good to write it here or in the
> > > functions below.
> > > 
> > 
> > I didn't want to mention the GuC specifically as it is intended to be
> > generic. The GuC is a firmware, which is called out as potenial
> > producers.
> 
> Sure, sorry for the confusion, I did not mean to name the producer here
> but rather to add something like: "This file contains the consumer code."
> to link the doc and the functions below.
> 

Ah, yes. Will add.

Matt

> > 
> > > > + */
> > > > +
> > > > +/**
> > > > + * xe_pagefault_init() - Page fault init
> > > > + * @xe: xe device instance
> > > > + *
> > > > + * Initialize Xe page fault state. Must be done after reading fuses.
> > > > + *
> > > > + * Return: 0 on Success, errno on failure
> > > > + */
> > > > +int xe_pagefault_init(struct xe_device *xe)
> > > > +{
> > > > +	/* TODO - implement */
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/**
> > > > + * xe_pagefault_reset() - Page fault reset for a GT
> > > > + * @xe: xe device instance
> > > > + * @gt: GT being reset
> > > > + *
> > > > + * Reset the Xe page fault state for a GT; that is, squash any pending faults on
> > > > + * the GT.
> > > > + */
> > > > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
> > > > +{
> > > > +	/* TODO - implement */
> > > > +}
> > > > +
> > > > +/**
> > > > + * xe_pagefault_handler() - Page fault handler
> > > > + * @xe: xe device instance
> > > > + * @pf: Page fault
> > > > + *
> > > > + * Sink the page fault to a queue (i.e., a memory buffer) and queue a worker to
> > > > + * service it. Safe to be called from IRQ or process context. Reclaim safe.
> > > > + *
> > > > + * Return: 0 on success, errno on failure
> > > > + */
> > > > +int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf)
> > > > +{
> > > > +	/* TODO - implement */
> > > > +	return 0;
> > > > +}
> > > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.h b/drivers/gpu/drm/xe/xe_pagefault.h
> > > > new file mode 100644
> > > > index 000000000000..bd0cdf9ed37f
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_pagefault.h
> > > > @@ -0,0 +1,19 @@
> > > > +/* SPDX-License-Identifier: MIT */
> > > > +/*
> > > > + * Copyright © 2025 Intel Corporation
> > > > + */
> > > > +
> > > > +#ifndef _XE_PAGEFAULT_H_
> > > > +#define _XE_PAGEFAULT_H_
> > > > +
> > > > +struct xe_device;
> > > > +struct xe_gt;
> > > > +struct xe_pagefault;
> > > > +
> > > > +int xe_pagefault_init(struct xe_device *xe);
> > > > +
> > > > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt);
> > > > +
> > > > +int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf);
> > > > +
> > > > +#endif
> > > > diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > > new file mode 100644
> > > > index 000000000000..fcff84f93dd8
> > > > --- /dev/null
> > > > +++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > > @@ -0,0 +1,125 @@
> > > > +/* SPDX-License-Identifier: MIT */
> > > > +/*
> > > > + * Copyright © 2025 Intel Corporation
> > > > + */
> > > > +
> > > > +#ifndef _XE_PAGEFAULT_TYPES_H_
> > > > +#define _XE_PAGEFAULT_TYPES_H_
> > > > +
> > > > +#include <linux/workqueue.h>
> > > > +
> > > > +struct xe_pagefault;
> > > > +struct xe_gt;
> > > > +
> > > > +/** enum xe_pagefault_access_type - Xe page fault access type */
> > > > +enum xe_pagefault_access_type {
> > > > +	/** @XE_PAGEFAULT_ACCESS_TYPE_READ: Read access type */
> > > > +	XE_PAGEFAULT_ACCESS_TYPE_READ	= 0,
> > > > +	/** @XE_PAGEFAULT_ACCESS_TYPE_WRITE: Write access type */
> > > > +	XE_PAGEFAULT_ACCESS_TYPE_WRITE	= 1,
> > > > +	/** @XE_PAGEFAULT_ACCESS_TYPE_ATOMIC: Atomic access type */
> > > > +	XE_PAGEFAULT_ACCESS_TYPE_ATOMIC	= 2,
> > > > +};
> > > > +
> > > > +/** enum xe_pagefault_type - Xe page fault type */
> > > > +enum xe_pagefault_type {
> > > > +	/** @XE_PAGEFAULT_TYPE_NOT_PRESENT: Not present */
> > > > +	XE_PAGEFAULT_TYPE_NOT_PRESENT		= 0,
> > > > +	/** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Write access violation */
> > > > +	XE_PAGEFAULT_WRITE_ACCESS_VIOLATION	= 1,
> > > > +	/** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Atomic access violation */
> > > > +	XE_PAGEFAULT_ATOMIC_ACCESS_VIOLATION	= 2,
> > > > +};
> > > > +
> > > > +/** struct xe_pagefault_ops - Xe pagefault ops (producer) */
> > > > +struct xe_pagefault_ops {
> > > > +	/**
> > > > +	 * @ack_fault: Ack fault
> > > > +	 * @pf: Page fault
> > > > +	 * @err: Error state of fault
> > > > +	 *
> > > > +	 * Page fault producer receives acknowledgment from the consumer and
> > > > +	 * sends the result to the HW/FW interface.
> > > > +	 */
> > > > +	void (*ack_fault)(struct xe_pagefault *pf, int err);
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct xe_pagefault - Xe page fault
> > > > + *
> > > > + * Generic page fault structure for communication between producer and consumer.
> > > > + * Carefully sized to be 64 bytes.
> > > > + */
> > > > +struct xe_pagefault {
> > > > +	/**
> > > > +	 * @gt: GT of fault
> > > > +	 *
> > > > +	 * XXX: We may want to decouple the GT from individual faults, as it's
> > > > +	 * unclear whether future platforms will always have a GT for all page
> > > > +	 * fault producers. Internally, the GT is used for stats, identifying
> > > > +	 * the appropriate VRAM region, and locating the migration queue.
> > > > +	 * Leaving this as-is for now, but we can revisit later to see if we
> > > > +	 * can convert it to use the Xe device pointer instead.
> > > > +	 */
> > > > +	struct xe_gt *gt;
> > > > +	/**
> > > > +	 * @consumer: State for the software handling the fault. Populated by
> > > > +	 * the producer and may be modified by the consumer to communicate
> > > > +	 * information back to the producer upon fault acknowledgment.
> > > > +	 */
> > > > +	struct {
> > > > +		/** @consumer.page_addr: address of page fault */
> > > > +		u64 page_addr;
> > > > +		/** @consumer.asid: address space ID */
> > > > +		u32 asid;
> > > > +		/** @consumer.access_type: access type */
> > > > +		u8 access_type;
> > > 
> > > For type safety we shoud use enum xe_pagefault_access_type instead of u8.
> > > 
> > > > +		/** @consumer.fault_type: fault type */
> > > > +		u8 fault_type;
> > > 
> > > Same here with enum xe_pagefault_type instead of u8.
> > > 
> > 
> > This structure is carefully sized to 64 bytes. Unsure if using enums
> > here will throw off that sizing. I'll check on this, but I suspect it
> > will, if it does I think u8 is the correct choice.
> > 
> > > > +#define XE_PAGEFAULT_LEVEL_NACK		0xff	/* Producer indicates nack fault */
> > > > +		/** @consumer.fault_level: fault level */
> > > > +		u8 fault_level;
> > > > +		/** @consumer.engine_class: engine class */
> > > > +		u8 engine_class;
> > > > +		/** consumer.reserved: reserved bits for future expansion */
> > > > +		u64 reserved;
> > > > +	} consumer;
> > > > +	/**
> > > > +	 * @producer: State for the producer (i.e., HW/FW interface). Populated
> > > > +	 * by the producer and should not be modified—or even inspected—by the
> > > > +	 * consumer, except for calling operations.
> > > > +	 */
> > > > +	struct {
> > > > +		/** @producer.private: private pointer */
> > > > +		void *private;
> > > > +		/** @producer.ops: operations */
> > > > +		const struct xe_pagefault_ops *ops;
> > > > +#define XE_PAGEFAULT_PRODUCER_MSG_LEN_DW	4
> > > > +		/**
> > > > +		 * producer.msg: page fault message, used by producer in fault
> > > 
> > > s/producer.msg/@producer.msg/
> > > 
> > 
> > +1
> > 
> > > > +		 * acknowledgement to formulate response to HW/FW interface.
> > > 
> > > s/acknowledgement/acknowledgment/
> > > 
> > 
> > +1
> > 
> > > > +		 */
> > > > +		u32 msg[XE_PAGEFAULT_PRODUCER_MSG_LEN_DW];
> > > 
> > > It is clear from patch #6 why it is more convenient to have this in
> > > struct xe_pagefault rather than local to the producer but this seems
> > > to go a bit against the elegant abstraction provided by this new page
> > > fault layers. producer.private could store a struct with {guc,msg}
> > > but that is probably overengineering so up to you.
> > > 
> > 
> > You can't malloc (with GFP_KERNEL) the producer path, at least for the
> > GuC as we are either in softIRQ context or under the CT lock which is
> > the path of reclaim. The only option is have these fields populated in
> > the stack xe_pagefault structure in the producer which then is copied
> > into a preallocated page fault queue, sized to sink all possible faults
> > on the device, on the consumer size. I can add kernel doc explaining
> > this.
> 
> Thanks for the explanation, would be helpful as kernel doc indeed.
> 
> Francois
> 
> > 
> > Matt
> > 
> > > Francois
> > > 
> > > > +	} producer;
> > > > +};
> > > > +
> > > > +/** struct xe_pagefault_queue: Xe pagefault queue (consumer) */
> > > > +struct xe_pagefault_queue {
> > > > +	/**
> > > > +	 * @data: Data in queue containing struct xe_pagefault, protected by
> > > > +	 * @lock
> > > > +	 */
> > > > +	void *data;
> > > > +	/** @size: Size of queue in bytes */
> > > > +	u32 size;
> > > > +	/** @head: Head pointer in bytes, moved by producer, protected by @lock */
> > > > +	u32 head;
> > > > +	/** @tail: Tail pointer in bytes, moved by consumer, protected by @lock */
> > > > +	u32 tail;
> > > > +	/** @lock: protects page fault queue */
> > > > +	spinlock_t lock;
> > > > +	/** @worker: to process page faults */
> > > > +	struct work_struct worker;
> > > > +};
> > > > +
> > > > +#endif
> > > > -- 
> > > > 2.34.1
> > > > 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/11] drm/xe: Implement xe_pagefault_init
  2025-08-27 16:30   ` Francois Dugast
@ 2025-08-27 16:49     ` Matthew Brost
  0 siblings, 0 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-27 16:49 UTC (permalink / raw)
  To: Francois Dugast
  Cc: intel-xe, thomas.hellstrom, himal.prasad.ghimiray, michal.mrozek

On Wed, Aug 27, 2025 at 06:30:30PM +0200, Francois Dugast wrote:
> On Tue, Aug 05, 2025 at 11:22:33PM -0700, Matthew Brost wrote:
> > Create pagefault queues and initialize them.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_device.c       |  5 ++
> >  drivers/gpu/drm/xe/xe_device_types.h |  6 ++
> >  drivers/gpu/drm/xe/xe_pagefault.c    | 93 +++++++++++++++++++++++++++-
> >  3 files changed, 102 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > index 57edbc63da6f..c7c8aee03841 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -50,6 +50,7 @@
> >  #include "xe_nvm.h"
> >  #include "xe_oa.h"
> >  #include "xe_observation.h"
> > +#include "xe_pagefault.h"
> >  #include "xe_pat.h"
> >  #include "xe_pcode.h"
> >  #include "xe_pm.h"
> > @@ -890,6 +891,10 @@ int xe_device_probe(struct xe_device *xe)
> >  	if (err)
> >  		return err;
> >  
> > +	err = xe_pagefault_init(xe);
> > +	if (err)
> > +		return err;
> > +
> >  	xe_nvm_init(xe);
> >  
> >  	err = xe_heci_gsc_init(xe);
> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> > index 01e8fa0d2f9f..6aa119026ce9 100644
> > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > @@ -17,6 +17,7 @@
> >  #include "xe_lmtt_types.h"
> >  #include "xe_memirq_types.h"
> >  #include "xe_oa_types.h"
> > +#include "xe_pagefault_types.h"
> >  #include "xe_platform_types.h"
> >  #include "xe_pmu_types.h"
> >  #include "xe_pt_types.h"
> > @@ -394,6 +395,11 @@ struct xe_device {
> >  		u32 next_asid;
> >  		/** @usm.lock: protects UM state */
> >  		struct rw_semaphore lock;
> > +		/** @usm.pf_wq: page fault work queue, unbound, high priority */
> > +		struct workqueue_struct *pf_wq;
> > +#define XE_PAGEFAULT_QUEUE_COUNT	4
> 
> Could we add a comment why value is 4?
> 

Sure. 4 was sweet spot for performance with the current implementation
to get good bandwidth ultization on BMG. Once the THP device pages
lands, this can be reduced to 2.

Matt

> > +		/** @pf_queue: Page fault queues */
> 
> s/@pf_queue/@usm.pf_queue/
> 
> Rest LGTM.
> 
> Francois
> 
> > +		struct xe_pagefault_queue pf_queue[XE_PAGEFAULT_QUEUE_COUNT];
> >  	} usm;
> >  
> >  	/** @pinned: pinned BO state */
> > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
> > index 3ce0e8d74b9d..14304c41eb23 100644
> > --- a/drivers/gpu/drm/xe/xe_pagefault.c
> > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > @@ -3,6 +3,10 @@
> >   * Copyright © 2025 Intel Corporation
> >   */
> >  
> > +#include <drm/drm_managed.h>
> > +
> > +#include "xe_device.h"
> > +#include "xe_gt_types.h"
> >  #include "xe_pagefault.h"
> >  #include "xe_pagefault_types.h"
> >  
> > @@ -19,6 +23,71 @@
> >   * with a single shared consumer.
> >   */
> >  
> > +static int xe_pagefault_entry_size(void)
> > +{
> > +	return roundup_pow_of_two(sizeof(struct xe_pagefault));
> > +}
> > +
> > +static void xe_pagefault_queue_work(struct work_struct *w)
> > +{
> > +	/* TODO: Implement */
> > +}
> > +
> > +static int xe_pagefault_queue_init(struct xe_device *xe,
> > +				   struct xe_pagefault_queue *pf_queue)
> > +{
> > +	struct xe_gt *gt;
> > +	int total_num_eus = 0;
> > +	u8 id;
> > +
> > +	for_each_gt(gt, xe, id) {
> > +		xe_dss_mask_t all_dss;
> > +		int num_dss, num_eus;
> > +
> > +		bitmap_or(all_dss, gt->fuse_topo.g_dss_mask,
> > +			  gt->fuse_topo.c_dss_mask, XE_MAX_DSS_FUSE_BITS);
> > +
> > +		num_dss = bitmap_weight(all_dss, XE_MAX_DSS_FUSE_BITS);
> > +		num_eus = bitmap_weight(gt->fuse_topo.eu_mask_per_dss,
> > +					XE_MAX_EU_FUSE_BITS) * num_dss;
> > +
> > +		total_num_eus += num_eus;
> > +	}
> > +
> > +	xe_assert(xe, total_num_eus);
> > +
> > +	/*
> > +	 * user can issue separate page faults per EU and per CS
> > +	 *
> > +	 * XXX: Multiplier required as compute UMD are getting PF queue errors
> > +	 * without it. Follow on why this multiplier is required.
> > +	 */
> > +#define PF_MULTIPLIER	8
> > +	pf_queue->size = (total_num_eus + XE_NUM_HW_ENGINES) *
> > +		xe_pagefault_entry_size() * PF_MULTIPLIER;
> > +	pf_queue->size = roundup_pow_of_two(pf_queue->size);
> > +#undef PF_MULTIPLIER
> > +
> > +	drm_dbg(&xe->drm, "xe_pagefault_entry_size=%d, total_num_eus=%d, pf_queue->size=%u",
> > +		xe_pagefault_entry_size(), total_num_eus, pf_queue->size);
> > +
> > +	pf_queue->data = devm_kzalloc(xe->drm.dev, pf_queue->size, GFP_KERNEL);
> > +	if (!pf_queue->data)
> > +		return -ENOMEM;
> > +
> > +	spin_lock_init(&pf_queue->lock);
> > +	INIT_WORK(&pf_queue->worker, xe_pagefault_queue_work);
> > +
> > +	return 0;
> > +}
> > +
> > +static void xe_pagefault_fini(void *arg)
> > +{
> > +	struct xe_device *xe = arg;
> > +
> > +	destroy_workqueue(xe->usm.pf_wq);
> > +}
> > +
> >  /**
> >   * xe_pagefault_init() - Page fault init
> >   * @xe: xe device instance
> > @@ -29,8 +98,28 @@
> >   */
> >  int xe_pagefault_init(struct xe_device *xe)
> >  {
> > -	/* TODO - implement */
> > -	return 0;
> > +	int err, i;
> > +
> > +	if (!xe->info.has_usm)
> > +		return 0;
> > +
> > +	xe->usm.pf_wq = alloc_workqueue("xe_page_fault_work_queue",
> > +					WQ_UNBOUND | WQ_HIGHPRI,
> > +					XE_PAGEFAULT_QUEUE_COUNT);
> > +	if (!xe->usm.pf_wq)
> > +		return -ENOMEM;
> > +
> > +	for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i) {
> > +		err = xe_pagefault_queue_init(xe, xe->usm.pf_queue + i);
> > +		if (err)
> > +			goto err_out;
> > +	}
> > +
> > +	return devm_add_action_or_reset(xe->drm.dev, xe_pagefault_fini, xe);
> > +
> > +err_out:
> > +	destroy_workqueue(xe->usm.pf_wq);
> > +	return err;
> >  }
> >  
> >  /**
> > -- 
> > 2.34.1
> > 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/11] drm/xe: Stub out new pagefault layer
  2025-08-27 16:03     ` Matthew Brost
  2025-08-27 16:25       ` Francois Dugast
@ 2025-08-27 18:00       ` Matthew Brost
  1 sibling, 0 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-27 18:00 UTC (permalink / raw)
  To: Francois Dugast
  Cc: intel-xe, thomas.hellstrom, himal.prasad.ghimiray, michal.mrozek

On Wed, Aug 27, 2025 at 09:03:57AM -0700, Matthew Brost wrote:
> On Wed, Aug 27, 2025 at 05:29:46PM +0200, Francois Dugast wrote:
> > On Tue, Aug 05, 2025 at 11:22:32PM -0700, Matthew Brost wrote:
> > > Stub out the new page fault layer and add kernel documentation. This is
> > > intended as a replacement for the GT page fault layer, enabling multiple
> > > producers to hook into a shared page fault consumer interface.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/Makefile             |   1 +
> > >  drivers/gpu/drm/xe/xe_pagefault.c       |  63 ++++++++++++
> > >  drivers/gpu/drm/xe/xe_pagefault.h       |  19 ++++
> > >  drivers/gpu/drm/xe/xe_pagefault_types.h | 125 ++++++++++++++++++++++++
> > >  4 files changed, 208 insertions(+)
> > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.c
> > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.h
> > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault_types.h
> > > 
> > > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > > index 8e0c3412a757..6fbebafe79c9 100644
> > > --- a/drivers/gpu/drm/xe/Makefile
> > > +++ b/drivers/gpu/drm/xe/Makefile
> > > @@ -93,6 +93,7 @@ xe-y += xe_bb.o \
> > >  	xe_nvm.o \
> > >  	xe_oa.o \
> > >  	xe_observation.o \
> > > +	xe_pagefault.o \
> > >  	xe_pat.o \
> > >  	xe_pci.o \
> > >  	xe_pcode.o \
> > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
> > > new file mode 100644
> > > index 000000000000..3ce0e8d74b9d
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > > @@ -0,0 +1,63 @@
> > > +// SPDX-License-Identifier: MIT
> > > +/*
> > > + * Copyright © 2025 Intel Corporation
> > > + */
> > > +
> > > +#include "xe_pagefault.h"
> > > +#include "xe_pagefault_types.h"
> > > +
> > > +/**
> > > + * DOC: Xe page faults
> > > + *
> > > + * Xe page faults are handled in two layers. The producer layer interacts with
> > > + * hardware or firmware to receive and parse faults into struct xe_pagefault,
> > > + * then forwards them to the consumer. The consumer layer services the faults
> > > + * (e.g., memory migration, page table updates) and acknowledges the result back
> > > + * to the producer, which then forwards the results to the hardware or firmware.
> > > + * The consumer uses a page fault queue sized to absorb all potential faults and
> > > + * a multi-threaded worker to process them. Multiple producers are supported,
> > > + * with a single shared consumer.
> > 
> > I am not through with the series yet but xe_pagefault seems to be the
> > consumer code only, while the producer code will be located elsewhere
> > such as in xe_guc*. If so, might be good to write it here or in the
> > functions below.
> > 
> 
> I didn't want to mention the GuC specifically as it is intended to be
> generic. The GuC is a firmware, which is called out as potenial
> producers.
> 
> > > + */
> > > +
> > > +/**
> > > + * xe_pagefault_init() - Page fault init
> > > + * @xe: xe device instance
> > > + *
> > > + * Initialize Xe page fault state. Must be done after reading fuses.
> > > + *
> > > + * Return: 0 on Success, errno on failure
> > > + */
> > > +int xe_pagefault_init(struct xe_device *xe)
> > > +{
> > > +	/* TODO - implement */
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * xe_pagefault_reset() - Page fault reset for a GT
> > > + * @xe: xe device instance
> > > + * @gt: GT being reset
> > > + *
> > > + * Reset the Xe page fault state for a GT; that is, squash any pending faults on
> > > + * the GT.
> > > + */
> > > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
> > > +{
> > > +	/* TODO - implement */
> > > +}
> > > +
> > > +/**
> > > + * xe_pagefault_handler() - Page fault handler
> > > + * @xe: xe device instance
> > > + * @pf: Page fault
> > > + *
> > > + * Sink the page fault to a queue (i.e., a memory buffer) and queue a worker to
> > > + * service it. Safe to be called from IRQ or process context. Reclaim safe.
> > > + *
> > > + * Return: 0 on success, errno on failure
> > > + */
> > > +int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf)
> > > +{
> > > +	/* TODO - implement */
> > > +	return 0;
> > > +}
> > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.h b/drivers/gpu/drm/xe/xe_pagefault.h
> > > new file mode 100644
> > > index 000000000000..bd0cdf9ed37f
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_pagefault.h
> > > @@ -0,0 +1,19 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2025 Intel Corporation
> > > + */
> > > +
> > > +#ifndef _XE_PAGEFAULT_H_
> > > +#define _XE_PAGEFAULT_H_
> > > +
> > > +struct xe_device;
> > > +struct xe_gt;
> > > +struct xe_pagefault;
> > > +
> > > +int xe_pagefault_init(struct xe_device *xe);
> > > +
> > > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt);
> > > +
> > > +int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf);
> > > +
> > > +#endif
> > > diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > new file mode 100644
> > > index 000000000000..fcff84f93dd8
> > > --- /dev/null
> > > +++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > @@ -0,0 +1,125 @@
> > > +/* SPDX-License-Identifier: MIT */
> > > +/*
> > > + * Copyright © 2025 Intel Corporation
> > > + */
> > > +
> > > +#ifndef _XE_PAGEFAULT_TYPES_H_
> > > +#define _XE_PAGEFAULT_TYPES_H_
> > > +
> > > +#include <linux/workqueue.h>
> > > +
> > > +struct xe_pagefault;
> > > +struct xe_gt;
> > > +
> > > +/** enum xe_pagefault_access_type - Xe page fault access type */
> > > +enum xe_pagefault_access_type {
> > > +	/** @XE_PAGEFAULT_ACCESS_TYPE_READ: Read access type */
> > > +	XE_PAGEFAULT_ACCESS_TYPE_READ	= 0,
> > > +	/** @XE_PAGEFAULT_ACCESS_TYPE_WRITE: Write access type */
> > > +	XE_PAGEFAULT_ACCESS_TYPE_WRITE	= 1,
> > > +	/** @XE_PAGEFAULT_ACCESS_TYPE_ATOMIC: Atomic access type */
> > > +	XE_PAGEFAULT_ACCESS_TYPE_ATOMIC	= 2,
> > > +};
> > > +
> > > +/** enum xe_pagefault_type - Xe page fault type */
> > > +enum xe_pagefault_type {
> > > +	/** @XE_PAGEFAULT_TYPE_NOT_PRESENT: Not present */
> > > +	XE_PAGEFAULT_TYPE_NOT_PRESENT		= 0,
> > > +	/** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Write access violation */
> > > +	XE_PAGEFAULT_WRITE_ACCESS_VIOLATION	= 1,
> > > +	/** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Atomic access violation */
> > > +	XE_PAGEFAULT_ATOMIC_ACCESS_VIOLATION	= 2,
> > > +};
> > > +
> > > +/** struct xe_pagefault_ops - Xe pagefault ops (producer) */
> > > +struct xe_pagefault_ops {
> > > +	/**
> > > +	 * @ack_fault: Ack fault
> > > +	 * @pf: Page fault
> > > +	 * @err: Error state of fault
> > > +	 *
> > > +	 * Page fault producer receives acknowledgment from the consumer and
> > > +	 * sends the result to the HW/FW interface.
> > > +	 */
> > > +	void (*ack_fault)(struct xe_pagefault *pf, int err);
> > > +};
> > > +
> > > +/**
> > > + * struct xe_pagefault - Xe page fault
> > > + *
> > > + * Generic page fault structure for communication between producer and consumer.
> > > + * Carefully sized to be 64 bytes.
> > > + */
> > > +struct xe_pagefault {
> > > +	/**
> > > +	 * @gt: GT of fault
> > > +	 *
> > > +	 * XXX: We may want to decouple the GT from individual faults, as it's
> > > +	 * unclear whether future platforms will always have a GT for all page
> > > +	 * fault producers. Internally, the GT is used for stats, identifying
> > > +	 * the appropriate VRAM region, and locating the migration queue.
> > > +	 * Leaving this as-is for now, but we can revisit later to see if we
> > > +	 * can convert it to use the Xe device pointer instead.
> > > +	 */
> > > +	struct xe_gt *gt;
> > > +	/**
> > > +	 * @consumer: State for the software handling the fault. Populated by
> > > +	 * the producer and may be modified by the consumer to communicate
> > > +	 * information back to the producer upon fault acknowledgment.
> > > +	 */
> > > +	struct {
> > > +		/** @consumer.page_addr: address of page fault */
> > > +		u64 page_addr;
> > > +		/** @consumer.asid: address space ID */
> > > +		u32 asid;
> > > +		/** @consumer.access_type: access type */
> > > +		u8 access_type;
> > 
> > For type safety we shoud use enum xe_pagefault_access_type instead of u8.
> > 
> > > +		/** @consumer.fault_type: fault type */
> > > +		u8 fault_type;
> > 
> > Same here with enum xe_pagefault_type instead of u8.
> > 
> 
> This structure is carefully sized to 64 bytes. Unsure if using enums
> here will throw off that sizing. I'll check on this, but I suspect it
> will, if it does I think u8 is the correct choice.
> 

Using an enum increases the size of structure. The page fault queue
aligns size to a pow2 so we go from 64 bytes to 128 bytes increasing
page fault queue size. I can add a comment around this too.

Matt

> > > +#define XE_PAGEFAULT_LEVEL_NACK		0xff	/* Producer indicates nack fault */
> > > +		/** @consumer.fault_level: fault level */
> > > +		u8 fault_level;
> > > +		/** @consumer.engine_class: engine class */
> > > +		u8 engine_class;
> > > +		/** consumer.reserved: reserved bits for future expansion */
> > > +		u64 reserved;
> > > +	} consumer;
> > > +	/**
> > > +	 * @producer: State for the producer (i.e., HW/FW interface). Populated
> > > +	 * by the producer and should not be modified—or even inspected—by the
> > > +	 * consumer, except for calling operations.
> > > +	 */
> > > +	struct {
> > > +		/** @producer.private: private pointer */
> > > +		void *private;
> > > +		/** @producer.ops: operations */
> > > +		const struct xe_pagefault_ops *ops;
> > > +#define XE_PAGEFAULT_PRODUCER_MSG_LEN_DW	4
> > > +		/**
> > > +		 * producer.msg: page fault message, used by producer in fault
> > 
> > s/producer.msg/@producer.msg/
> > 
> 
> +1
> 
> > > +		 * acknowledgement to formulate response to HW/FW interface.
> > 
> > s/acknowledgement/acknowledgment/
> > 
> 
> +1
> 
> > > +		 */
> > > +		u32 msg[XE_PAGEFAULT_PRODUCER_MSG_LEN_DW];
> > 
> > It is clear from patch #6 why it is more convenient to have this in
> > struct xe_pagefault rather than local to the producer but this seems
> > to go a bit against the elegant abstraction provided by this new page
> > fault layers. producer.private could store a struct with {guc,msg}
> > but that is probably overengineering so up to you.
> > 
> 
> You can't malloc (with GFP_KERNEL) the producer path, at least for the
> GuC as we are either in softIRQ context or under the CT lock which is
> the path of reclaim. The only option is have these fields populated in
> the stack xe_pagefault structure in the producer which then is copied
> into a preallocated page fault queue, sized to sink all possible faults
> on the device, on the consumer size. I can add kernel doc explaining
> this.
> 
> Matt
> 
> > Francois
> > 
> > > +	} producer;
> > > +};
> > > +
> > > +/** struct xe_pagefault_queue: Xe pagefault queue (consumer) */
> > > +struct xe_pagefault_queue {
> > > +	/**
> > > +	 * @data: Data in queue containing struct xe_pagefault, protected by
> > > +	 * @lock
> > > +	 */
> > > +	void *data;
> > > +	/** @size: Size of queue in bytes */
> > > +	u32 size;
> > > +	/** @head: Head pointer in bytes, moved by producer, protected by @lock */
> > > +	u32 head;
> > > +	/** @tail: Tail pointer in bytes, moved by consumer, protected by @lock */
> > > +	u32 tail;
> > > +	/** @lock: protects page fault queue */
> > > +	spinlock_t lock;
> > > +	/** @worker: to process page faults */
> > > +	struct work_struct worker;
> > > +};
> > > +
> > > +#endif
> > > -- 
> > > 2.34.1
> > > 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 04/11] drm/xe: Implement xe_pagefault_handler
  2025-08-06  6:22 ` [PATCH 04/11] drm/xe: Implement xe_pagefault_handler Matthew Brost
@ 2025-08-28 11:26   ` Francois Dugast
  2025-08-28 20:24   ` Summers, Stuart
  1 sibling, 0 replies; 51+ messages in thread
From: Francois Dugast @ 2025-08-28 11:26 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, thomas.hellstrom, himal.prasad.ghimiray, michal.mrozek

On Tue, Aug 05, 2025 at 11:22:35PM -0700, Matthew Brost wrote:
> Enqueue (copy) the input struct xe_pagefault into a queue (i.e., into a
> memory buffer) and schedule a worker to service it.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>

Reviewed-by: Francois Dugast <francois.dugast@intel.com>

> ---
>  drivers/gpu/drm/xe/xe_pagefault.c | 32 +++++++++++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
> index aef389e51612..98be3203a9df 100644
> --- a/drivers/gpu/drm/xe/xe_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> @@ -3,6 +3,8 @@
>   * Copyright © 2025 Intel Corporation
>   */
>  
> +#include <linux/circ_buf.h>
> +
>  #include <drm/drm_managed.h>
>  
>  #include "xe_device.h"
> @@ -156,6 +158,14 @@ void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
>  		xe_pagefault_queue_reset(xe, gt, xe->usm.pf_queue + i);
>  }
>  
> +static bool xe_pagefault_queue_full(struct xe_pagefault_queue *pf_queue)
> +{
> +	lockdep_assert_held(&pf_queue->lock);
> +
> +	return CIRC_SPACE(pf_queue->head, pf_queue->tail, pf_queue->size) <=
> +		xe_pagefault_entry_size();
> +}
> +
>  /**
>   * xe_pagefault_handler() - Page fault handler
>   * @xe: xe device instance
> @@ -168,6 +178,24 @@ void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
>   */
>  int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf)
>  {
> -	/* TODO - implement */
> -	return 0;
> +	struct xe_pagefault_queue *pf_queue = xe->usm.pf_queue +
> +		(pf->consumer.asid % XE_PAGEFAULT_QUEUE_COUNT);
> +	unsigned long flags;
> +	bool full;
> +
> +	spin_lock_irqsave(&pf_queue->lock, flags);
> +	full = xe_pagefault_queue_full(pf_queue);
> +	if (!full) {
> +		memcpy(pf_queue->data + pf_queue->head, pf, sizeof(*pf));
> +		pf_queue->head = (pf_queue->head + xe_pagefault_entry_size()) %
> +			pf_queue->size;
> +		queue_work(xe->usm.pf_wq, &pf_queue->worker);
> +	} else {
> +		drm_warn(&xe->drm,
> +			 "PageFault Queue (%d) full, shouldn't be possible\n",
> +			 pf->consumer.asid % XE_PAGEFAULT_QUEUE_COUNT);
> +	}
> +	spin_unlock_irqrestore(&pf_queue->lock, flags);
> +
> +	return full ? -ENOSPC : 0;
>  }
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/11] drm/xe: Implement xe_pagefault_queue_work
  2025-08-06  6:22 ` [PATCH 05/11] drm/xe: Implement xe_pagefault_queue_work Matthew Brost
@ 2025-08-28 12:29   ` Francois Dugast
  2025-08-28 18:39     ` Matthew Brost
  2025-08-28 22:04   ` Summers, Stuart
  1 sibling, 1 reply; 51+ messages in thread
From: Francois Dugast @ 2025-08-28 12:29 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, thomas.hellstrom, himal.prasad.ghimiray, michal.mrozek

On Tue, Aug 05, 2025 at 11:22:36PM -0700, Matthew Brost wrote:
> Implement a worker that services page faults, using the same
> implementation as in xe_gt_pagefault.c.

Also the minor refactoring and cleanup along the way helps readability.

A few nits below.

> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_pagefault.c | 240 +++++++++++++++++++++++++++++-
>  1 file changed, 239 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
> index 98be3203a9df..474412c21ec3 100644
> --- a/drivers/gpu/drm/xe/xe_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> @@ -5,12 +5,20 @@
>  
>  #include <linux/circ_buf.h>
>  
> +#include <drm/drm_exec.h>
>  #include <drm/drm_managed.h>
>  
> +#include "xe_bo.h"
>  #include "xe_device.h"
> +#include "xe_gt_printk.h"
>  #include "xe_gt_types.h"
> +#include "xe_gt_stats.h"

Move up to maintain alphabetically ordered.

> +#include "xe_hw_engine.h"
>  #include "xe_pagefault.h"
>  #include "xe_pagefault_types.h"
> +#include "xe_svm.h"
> +#include "xe_trace_bo.h"
> +#include "xe_vm.h"
>  
>  /**
>   * DOC: Xe page faults
> @@ -30,9 +38,239 @@ static int xe_pagefault_entry_size(void)
>  	return roundup_pow_of_two(sizeof(struct xe_pagefault));
>  }
>  
> +static int xe_pagefault_begin(struct drm_exec *exec, struct xe_vma *vma,
> +			      bool atomic, unsigned int id)

Please rename id to tile_id for clarity.

Francois

> +{
> +	struct xe_bo *bo = xe_vma_bo(vma);
> +	struct xe_vm *vm = xe_vma_vm(vma);
> +	int err;
> +
> +	err = xe_vm_lock_vma(exec, vma);
> +	if (err)
> +		return err;
> +
> +	if (atomic && IS_DGFX(vm->xe)) {
> +		if (xe_vma_is_userptr(vma)) {
> +			err = -EACCES;
> +			return err;
> +		}
> +
> +		/* Migrate to VRAM, move should invalidate the VMA first */
> +		err = xe_bo_migrate(bo, XE_PL_VRAM0 + id);
> +		if (err)
> +			return err;
> +	} else if (bo) {
> +		/* Create backing store if needed */
> +		err = xe_bo_validate(bo, vm, true);
> +		if (err)
> +			return err;
> +	}
> +
> +	return 0;
> +}
> +
> +static int xe_pagefault_handle_vma(struct xe_gt *gt, struct xe_vma *vma,
> +				   bool atomic)
> +{
> +	struct xe_vm *vm = xe_vma_vm(vma);
> +	struct xe_tile *tile = gt_to_tile(gt);
> +	struct drm_exec exec;
> +	struct dma_fence *fence;
> +	ktime_t end = 0;
> +	int err;
> +
> +	lockdep_assert_held_write(&vm->lock);
> +
> +	xe_gt_stats_incr(gt, XE_GT_STATS_ID_VMA_PAGEFAULT_COUNT, 1);
> +	xe_gt_stats_incr(gt, XE_GT_STATS_ID_VMA_PAGEFAULT_KB,
> +			 xe_vma_size(vma) / SZ_1K);
> +
> +	trace_xe_vma_pagefault(vma);
> +
> +	/* Check if VMA is valid, opportunistic check only */
> +	if (xe_vm_has_valid_gpu_mapping(tile, vma->tile_present,
> +					vma->tile_invalidated) && !atomic)
> +		return 0;
> +
> +retry_userptr:
> +	if (xe_vma_is_userptr(vma) &&
> +	    xe_vma_userptr_check_repin(to_userptr_vma(vma))) {
> +		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
> +
> +		err = xe_vma_userptr_pin_pages(uvma);
> +		if (err)
> +			return err;
> +	}
> +
> +	/* Lock VM and BOs dma-resv */
> +	drm_exec_init(&exec, 0, 0);
> +	drm_exec_until_all_locked(&exec) {
> +		err = xe_pagefault_begin(&exec, vma, atomic, tile->id);
> +		drm_exec_retry_on_contention(&exec);
> +		if (xe_vm_validate_should_retry(&exec, err, &end))
> +			err = -EAGAIN;
> +		if (err)
> +			goto unlock_dma_resv;
> +
> +		/* Bind VMA only to the GT that has faulted */
> +		trace_xe_vma_pf_bind(vma);
> +		fence = xe_vma_rebind(vm, vma, BIT(tile->id));
> +		if (IS_ERR(fence)) {
> +			err = PTR_ERR(fence);
> +			if (xe_vm_validate_should_retry(&exec, err, &end))
> +				err = -EAGAIN;
> +			goto unlock_dma_resv;
> +		}
> +	}
> +
> +	dma_fence_wait(fence, false);
> +	dma_fence_put(fence);
> +
> +unlock_dma_resv:
> +	drm_exec_fini(&exec);
> +	if (err == -EAGAIN)
> +		goto retry_userptr;
> +
> +	return err;
> +}
> +
> +static bool
> +xe_pagefault_access_is_atomic(enum xe_pagefault_access_type access_type)
> +{
> +	return access_type == XE_PAGEFAULT_ACCESS_TYPE_ATOMIC;
> +}
> +
> +static struct xe_vm *xe_pagefault_asid_to_vm(struct xe_device *xe, u32 asid)
> +{
> +	struct xe_vm *vm;
> +
> +	down_read(&xe->usm.lock);
> +	vm = xa_load(&xe->usm.asid_to_vm, asid);
> +	if (vm && xe_vm_in_fault_mode(vm))
> +		xe_vm_get(vm);
> +	else
> +		vm = ERR_PTR(-EINVAL);
> +	up_read(&xe->usm.lock);
> +
> +	return vm;
> +}
> +
> +static int xe_pagefault_service(struct xe_pagefault *pf)
> +{
> +	struct xe_gt *gt = pf->gt;
> +	struct xe_device *xe = gt_to_xe(gt);
> +	struct xe_vm *vm;
> +	struct xe_vma *vma = NULL;
> +	int err;
> +	bool atomic;
> +
> +	/* Producer flagged this fault to be nacked */
> +	if (pf->consumer.fault_level == XE_PAGEFAULT_LEVEL_NACK)
> +		return -EFAULT;
> +
> +	vm = xe_pagefault_asid_to_vm(xe, pf->consumer.asid);
> +	if (IS_ERR(vm))
> +		return PTR_ERR(vm);
> +
> +	/*
> +	 * TODO: Change to read lock? Using write lock for simplicity.
> +	 */
> +	down_write(&vm->lock);
> +
> +	if (xe_vm_is_closed(vm)) {
> +		err = -ENOENT;
> +		goto unlock_vm;
> +	}
> +
> +	vma = xe_vm_find_vma_by_addr(vm, pf->consumer.page_addr);
> +	if (!vma) {
> +		err = -EINVAL;
> +		goto unlock_vm;
> +	}
> +
> +	atomic = xe_pagefault_access_is_atomic(pf->consumer.access_type);
> +
> +	if (xe_vma_is_cpu_addr_mirror(vma))
> +		err = xe_svm_handle_pagefault(vm, vma, gt,
> +					      pf->consumer.page_addr, atomic);
> +	else
> +		err = xe_pagefault_handle_vma(gt, vma, atomic);
> +
> +unlock_vm:
> +	if (!err)
> +		vm->usm.last_fault_vma = vma;
> +	up_write(&vm->lock);
> +	xe_vm_put(vm);
> +
> +	return err;
> +}
> +
> +static bool xe_pagefault_queue_pop(struct xe_pagefault_queue *pf_queue,
> +				   struct xe_pagefault *pf)
> +{
> +	bool found_fault = false;
> +
> +	spin_lock_irq(&pf_queue->lock);
> +	if (pf_queue->tail != pf_queue->head) {
> +		memcpy(pf, pf_queue->data + pf_queue->tail, sizeof(*pf));
> +		pf_queue->tail = (pf_queue->tail + xe_pagefault_entry_size()) %
> +			pf_queue->size;
> +		found_fault = true;
> +	}
> +	spin_unlock_irq(&pf_queue->lock);
> +
> +	return found_fault;
> +}
> +
> +static void xe_pagefault_print(struct xe_pagefault *pf)
> +{
> +	xe_gt_dbg(pf->gt, "\n\tASID: %d\n"
> +		  "\tFaulted Address: 0x%08x%08x\n"
> +		  "\tFaultType: %d\n"
> +		  "\tAccessType: %d\n"
> +		  "\tFaultLevel: %d\n"
> +		  "\tEngineClass: %d %s\n",
> +		  pf->consumer.asid,
> +		  upper_32_bits(pf->consumer.page_addr),
> +		  lower_32_bits(pf->consumer.page_addr),
> +		  pf->consumer.fault_type,
> +		  pf->consumer.access_type,
> +		  pf->consumer.fault_level,
> +		  pf->consumer.engine_class,
> +		  xe_hw_engine_class_to_str(pf->consumer.engine_class));
> +}
> +
>  static void xe_pagefault_queue_work(struct work_struct *w)
>  {
> -	/* TODO: Implement */
> +	struct xe_pagefault_queue *pf_queue =
> +		container_of(w, typeof(*pf_queue), worker);
> +	struct xe_pagefault pf;
> +	unsigned long threshold;
> +
> +#define USM_QUEUE_MAX_RUNTIME_MS      20
> +	threshold = jiffies + msecs_to_jiffies(USM_QUEUE_MAX_RUNTIME_MS);
> +
> +	while (xe_pagefault_queue_pop(pf_queue, &pf)) {
> +		int err;
> +
> +		if (!pf.gt)	/* Fault squashed during reset */
> +			continue;
> +
> +		err = xe_pagefault_service(&pf);
> +		if (err) {
> +			xe_pagefault_print(&pf);
> +			xe_gt_dbg(pf.gt, "Fault response: Unsuccessful %pe\n",
> +				  ERR_PTR(err));
> +		}
> +
> +		pf.producer.ops->ack_fault(&pf, err);
> +
> +		if (time_after(jiffies, threshold)) {
> +			queue_work(gt_to_xe(pf.gt)->usm.pf_wq, w);
> +			break;
> +		}
> +	}
> +#undef USM_QUEUE_MAX_RUNTIME_MS
>  }
>  
>  static int xe_pagefault_queue_init(struct xe_device *xe,
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/11] drm/xe: Add xe_guc_pagefault layer
  2025-08-06  6:22 ` [PATCH 06/11] drm/xe: Add xe_guc_pagefault layer Matthew Brost
@ 2025-08-28 13:27   ` Francois Dugast
  2025-08-28 18:38     ` Matthew Brost
  2025-08-28 22:11   ` Summers, Stuart
  1 sibling, 1 reply; 51+ messages in thread
From: Francois Dugast @ 2025-08-28 13:27 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, thomas.hellstrom, himal.prasad.ghimiray, michal.mrozek

On Tue, Aug 05, 2025 at 11:22:37PM -0700, Matthew Brost wrote:
> Add xe_guc_pagefault layer (producer) which parses G2H fault messages
> messages into struct xe_pagefault, forwards them to the page fault layer
> (consumer) for servicing, and provides a vfunc to acknowledge faults to
> the GuC upon completion. Replace the old (and incorrect) GT page fault
> layer with this new layer throughout the driver.
> 
> Signed-off-bt: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile           |  2 +-
>  drivers/gpu/drm/xe/xe_gt.c            |  6 --
>  drivers/gpu/drm/xe/xe_guc_ct.c        |  6 +-
>  drivers/gpu/drm/xe/xe_guc_pagefault.c | 94 +++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_guc_pagefault.h | 13 ++++
>  drivers/gpu/drm/xe/xe_svm.c           |  3 +-
>  drivers/gpu/drm/xe/xe_vm.c            |  1 -
>  7 files changed, 110 insertions(+), 15 deletions(-)
>  create mode 100644 drivers/gpu/drm/xe/xe_guc_pagefault.c
>  create mode 100644 drivers/gpu/drm/xe/xe_guc_pagefault.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> index 6fbebafe79c9..c103c114b75c 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -58,7 +58,6 @@ xe-y += xe_bb.o \
>  	xe_gt_freq.o \
>  	xe_gt_idle.o \
>  	xe_gt_mcr.o \
> -	xe_gt_pagefault.o \
>  	xe_gt_sysfs.o \
>  	xe_gt_throttle.o \
>  	xe_gt_tlb_invalidation.o \
> @@ -75,6 +74,7 @@ xe-y += xe_bb.o \
>  	xe_guc_id_mgr.o \
>  	xe_guc_klv_helpers.o \
>  	xe_guc_log.o \
> +	xe_guc_pagefault.o \
>  	xe_guc_pc.o \
>  	xe_guc_submit.o \
>  	xe_heci_gsc.o \
> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> index 5aa03f89a062..35c7ba7828a6 100644
> --- a/drivers/gpu/drm/xe/xe_gt.c
> +++ b/drivers/gpu/drm/xe/xe_gt.c
> @@ -32,7 +32,6 @@
>  #include "xe_gt_freq.h"
>  #include "xe_gt_idle.h"
>  #include "xe_gt_mcr.h"
> -#include "xe_gt_pagefault.h"
>  #include "xe_gt_printk.h"
>  #include "xe_gt_sriov_pf.h"
>  #include "xe_gt_sriov_vf.h"
> @@ -634,10 +633,6 @@ int xe_gt_init(struct xe_gt *gt)
>  	if (err)
>  		return err;
>  
> -	err = xe_gt_pagefault_init(gt);
> -	if (err)
> -		return err;
> -
>  	err = xe_gt_idle_init(&gt->gtidle);
>  	if (err)
>  		return err;
> @@ -848,7 +843,6 @@ static int gt_reset(struct xe_gt *gt)
>  	xe_uc_gucrc_disable(&gt->uc);
>  	xe_uc_stop_prepare(&gt->uc);
>  	xe_pagefault_reset(gt_to_xe(gt), gt);
> -	xe_gt_pagefault_reset(gt);
>  
>  	xe_uc_stop(&gt->uc);
>  
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> index 3f4e6a46ff16..67b5dd182207 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -21,7 +21,6 @@
>  #include "xe_devcoredump.h"
>  #include "xe_device.h"
>  #include "xe_gt.h"
> -#include "xe_gt_pagefault.h"
>  #include "xe_gt_printk.h"
>  #include "xe_gt_sriov_pf_control.h"
>  #include "xe_gt_sriov_pf_monitor.h"
> @@ -29,6 +28,7 @@
>  #include "xe_gt_tlb_invalidation.h"
>  #include "xe_guc.h"
>  #include "xe_guc_log.h"
> +#include "xe_guc_pagefault.h"
>  #include "xe_guc_relay.h"
>  #include "xe_guc_submit.h"
>  #include "xe_map.h"
> @@ -1419,10 +1419,6 @@ static int process_g2h_msg(struct xe_guc_ct *ct, u32 *msg, u32 len)
>  		ret = xe_guc_tlb_invalidation_done_handler(guc, payload,
>  							   adj_len);
>  		break;
> -	case XE_GUC_ACTION_ACCESS_COUNTER_NOTIFY:
> -		ret = xe_guc_access_counter_notify_handler(guc, payload,
> -							   adj_len);
> -		break;
>  	case XE_GUC_ACTION_GUC2PF_RELAY_FROM_VF:
>  		ret = xe_guc_relay_process_guc2pf(&guc->relay, hxg, hxg_len);
>  		break;
> diff --git a/drivers/gpu/drm/xe/xe_guc_pagefault.c b/drivers/gpu/drm/xe/xe_guc_pagefault.c
> new file mode 100644
> index 000000000000..0aa069d2a581
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_guc_pagefault.c
> @@ -0,0 +1,94 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#include "abi/guc_actions_abi.h"
> +#include "xe_guc.h"
> +#include "xe_guc_ct.h"
> +#include "xe_guc_pagefault.h"
> +#include "xe_pagefault.h"
> +
> +static void guc_ack_fault(struct xe_pagefault *pf, int err)
> +{
> +	u32 vfid = FIELD_GET(PFD_VFID, pf->producer.msg[2]);
> +	u32 engine_instance = FIELD_GET(PFD_ENG_INSTANCE, pf->producer.msg[0]);
> +	u32 engine_class = FIELD_GET(PFD_ENG_CLASS, pf->producer.msg[0]);
> +	u32 pdata = FIELD_GET(PFD_PDATA_LO, pf->producer.msg[0]) |
> +		(FIELD_GET(PFD_PDATA_HI, pf->producer.msg[1]) <<
> +		 PFD_PDATA_HI_SHIFT);
> +	u32 action[] = {
> +		XE_GUC_ACTION_PAGE_FAULT_RES_DESC,
> +
> +		FIELD_PREP(PFR_VALID, 1) |
> +		FIELD_PREP(PFR_SUCCESS, !!err) |
> +		FIELD_PREP(PFR_REPLY, PFR_ACCESS) |
> +		FIELD_PREP(PFR_DESC_TYPE, FAULT_RESPONSE_DESC) |
> +		FIELD_PREP(PFR_ASID, pf->consumer.asid),
> +
> +		FIELD_PREP(PFR_VFID, vfid) |
> +		FIELD_PREP(PFR_ENG_INSTANCE, engine_instance) |
> +		FIELD_PREP(PFR_ENG_CLASS, engine_class) |
> +		FIELD_PREP(PFR_PDATA, pdata),
> +	};
> +	struct xe_guc *guc = pf->producer.private;
> +
> +	xe_guc_ct_send(&guc->ct, action, ARRAY_SIZE(action), 0, 0);
> +}
> +
> +static const struct xe_pagefault_ops guc_pagefault_ops = {
> +	.ack_fault = guc_ack_fault,
> +};
> +
> +/**
> + * xe_guc_pagefault_handler() - G2H page fault handler
> + * @guc: GuC object
> + * @msg: G2H message
> + * @len: Length of G2H message
> + *
> + * Parse GuC to host (G2H) message into a struct xe_pagefault and forward onto
> + * the Xe page fault layer.
> + *
> + * Return: 0 on success, errno on failure
> + */
> +int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len)
> +{
> +	struct xe_pagefault pf;
> +	int i;
> +
> +#define GUC_PF_MSG_LEN_DW	\
> +	(sizeof(struct xe_guc_pagefault_desc) / sizeof(u32))
> +
> +	BUILD_BUG_ON(GUC_PF_MSG_LEN_DW > XE_PAGEFAULT_PRODUCER_MSG_LEN_DW);
> +
> +	if (len != GUC_PF_MSG_LEN_DW)
> +		return -EPROTO;
> +
> +	pf.gt = guc_to_gt(guc);
> +
> +	/*
> +	 * XXX: These values happen to match the enum in xe_pagefault_types.h.
> +	 * If that changes, we’ll need to remap them here.
> +	 */
> +	pf.consumer.page_addr = (u64)(FIELD_GET(PFD_VIRTUAL_ADDR_HI, msg[3])
> +				      << PFD_VIRTUAL_ADDR_HI_SHIFT) |
> +		(FIELD_GET(PFD_VIRTUAL_ADDR_LO, msg[2]) <<
> +		 PFD_VIRTUAL_ADDR_LO_SHIFT);
> +	pf.consumer.asid = FIELD_GET(PFD_ASID, msg[1]);
> +	pf.consumer.access_type = FIELD_GET(PFD_ACCESS_TYPE, msg[2]);;
> +	pf.consumer.fault_type = FIELD_GET(PFD_FAULT_TYPE, msg[2]);
> +	if (FIELD_GET(XE2_PFD_TRVA_FAULT, msg[0]))
> +		pf.consumer.fault_level = XE_PAGEFAULT_LEVEL_NACK;
> +	else
> +		pf.consumer.fault_level = FIELD_GET(PFD_FAULT_LEVEL, msg[0]);
> +	pf.consumer.engine_class = FIELD_GET(PFD_ENG_CLASS, msg[0]);
> +
> +	pf.producer.private = guc;
> +	pf.producer.ops = &guc_pagefault_ops;
> +	for (i = 0; i < GUC_PF_MSG_LEN_DW; ++i)
> +		pf.producer.msg[i] = msg[i];
> +
> +#undef GUC_PF_MSG_LEN_DW
> +
> +	return xe_pagefault_handler(guc_to_xe(guc), &pf);
> +}
> diff --git a/drivers/gpu/drm/xe/xe_guc_pagefault.h b/drivers/gpu/drm/xe/xe_guc_pagefault.h
> new file mode 100644
> index 000000000000..0723f57b8ea9
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_guc_pagefault.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#ifndef _XE_GUC_PAGEFAULT_H_
> +#define _XE_GUC_PAGEFAULT_H_
> +
> +#include <linux/types.h>

For this to compile we are missing:

    struct xe_guc;

Francois

> +
> +int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len);
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index 10c8a1bcb86e..1bcf3ba3b350 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -109,8 +109,7 @@ xe_svm_garbage_collector_add_range(struct xe_vm *vm, struct xe_svm_range *range,
>  			      &vm->svm.garbage_collector.range_list);
>  	spin_unlock(&vm->svm.garbage_collector.lock);
>  
> -	queue_work(xe_device_get_root_tile(xe)->primary_gt->usm.pf_wq,
> -		   &vm->svm.garbage_collector.work);
> +	queue_work(xe->usm.pf_wq, &vm->svm.garbage_collector.work);
>  }
>  
>  static u8
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 432ea325677d..c9ae13c32117 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -27,7 +27,6 @@
>  #include "xe_device.h"
>  #include "xe_drm_client.h"
>  #include "xe_exec_queue.h"
> -#include "xe_gt_pagefault.h"
>  #include "xe_gt_tlb_invalidation.h"
>  #include "xe_migrate.h"
>  #include "xe_pat.h"
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/11] drm/xe: Add xe_guc_pagefault layer
  2025-08-28 13:27   ` Francois Dugast
@ 2025-08-28 18:38     ` Matthew Brost
  0 siblings, 0 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-28 18:38 UTC (permalink / raw)
  To: Francois Dugast
  Cc: intel-xe, thomas.hellstrom, himal.prasad.ghimiray, michal.mrozek

On Thu, Aug 28, 2025 at 03:27:29PM +0200, Francois Dugast wrote:
> On Tue, Aug 05, 2025 at 11:22:37PM -0700, Matthew Brost wrote:
> > Add xe_guc_pagefault layer (producer) which parses G2H fault messages
> > messages into struct xe_pagefault, forwards them to the page fault layer
> > (consumer) for servicing, and provides a vfunc to acknowledge faults to
> > the GuC upon completion. Replace the old (and incorrect) GT page fault
> > layer with this new layer throughout the driver.
> > 
> > Signed-off-bt: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile           |  2 +-
> >  drivers/gpu/drm/xe/xe_gt.c            |  6 --
> >  drivers/gpu/drm/xe/xe_guc_ct.c        |  6 +-
> >  drivers/gpu/drm/xe/xe_guc_pagefault.c | 94 +++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_guc_pagefault.h | 13 ++++
> >  drivers/gpu/drm/xe/xe_svm.c           |  3 +-
> >  drivers/gpu/drm/xe/xe_vm.c            |  1 -
> >  7 files changed, 110 insertions(+), 15 deletions(-)
> >  create mode 100644 drivers/gpu/drm/xe/xe_guc_pagefault.c
> >  create mode 100644 drivers/gpu/drm/xe/xe_guc_pagefault.h
> > 
> > diff --git a/drivers/gpu/drm/xe/Makefile b/drivers/gpu/drm/xe/Makefile
> > index 6fbebafe79c9..c103c114b75c 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -58,7 +58,6 @@ xe-y += xe_bb.o \
> >  	xe_gt_freq.o \
> >  	xe_gt_idle.o \
> >  	xe_gt_mcr.o \
> > -	xe_gt_pagefault.o \
> >  	xe_gt_sysfs.o \
> >  	xe_gt_throttle.o \
> >  	xe_gt_tlb_invalidation.o \
> > @@ -75,6 +74,7 @@ xe-y += xe_bb.o \
> >  	xe_guc_id_mgr.o \
> >  	xe_guc_klv_helpers.o \
> >  	xe_guc_log.o \
> > +	xe_guc_pagefault.o \
> >  	xe_guc_pc.o \
> >  	xe_guc_submit.o \
> >  	xe_heci_gsc.o \
> > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > index 5aa03f89a062..35c7ba7828a6 100644
> > --- a/drivers/gpu/drm/xe/xe_gt.c
> > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > @@ -32,7 +32,6 @@
> >  #include "xe_gt_freq.h"
> >  #include "xe_gt_idle.h"
> >  #include "xe_gt_mcr.h"
> > -#include "xe_gt_pagefault.h"
> >  #include "xe_gt_printk.h"
> >  #include "xe_gt_sriov_pf.h"
> >  #include "xe_gt_sriov_vf.h"
> > @@ -634,10 +633,6 @@ int xe_gt_init(struct xe_gt *gt)
> >  	if (err)
> >  		return err;
> >  
> > -	err = xe_gt_pagefault_init(gt);
> > -	if (err)
> > -		return err;
> > -
> >  	err = xe_gt_idle_init(&gt->gtidle);
> >  	if (err)
> >  		return err;
> > @@ -848,7 +843,6 @@ static int gt_reset(struct xe_gt *gt)
> >  	xe_uc_gucrc_disable(&gt->uc);
> >  	xe_uc_stop_prepare(&gt->uc);
> >  	xe_pagefault_reset(gt_to_xe(gt), gt);
> > -	xe_gt_pagefault_reset(gt);
> >  
> >  	xe_uc_stop(&gt->uc);
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> > index 3f4e6a46ff16..67b5dd182207 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > @@ -21,7 +21,6 @@
> >  #include "xe_devcoredump.h"
> >  #include "xe_device.h"
> >  #include "xe_gt.h"
> > -#include "xe_gt_pagefault.h"
> >  #include "xe_gt_printk.h"
> >  #include "xe_gt_sriov_pf_control.h"
> >  #include "xe_gt_sriov_pf_monitor.h"
> > @@ -29,6 +28,7 @@
> >  #include "xe_gt_tlb_invalidation.h"
> >  #include "xe_guc.h"
> >  #include "xe_guc_log.h"
> > +#include "xe_guc_pagefault.h"
> >  #include "xe_guc_relay.h"
> >  #include "xe_guc_submit.h"
> >  #include "xe_map.h"
> > @@ -1419,10 +1419,6 @@ static int process_g2h_msg(struct xe_guc_ct *ct, u32 *msg, u32 len)
> >  		ret = xe_guc_tlb_invalidation_done_handler(guc, payload,
> >  							   adj_len);
> >  		break;
> > -	case XE_GUC_ACTION_ACCESS_COUNTER_NOTIFY:
> > -		ret = xe_guc_access_counter_notify_handler(guc, payload,
> > -							   adj_len);
> > -		break;
> >  	case XE_GUC_ACTION_GUC2PF_RELAY_FROM_VF:
> >  		ret = xe_guc_relay_process_guc2pf(&guc->relay, hxg, hxg_len);
> >  		break;
> > diff --git a/drivers/gpu/drm/xe/xe_guc_pagefault.c b/drivers/gpu/drm/xe/xe_guc_pagefault.c
> > new file mode 100644
> > index 000000000000..0aa069d2a581
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_guc_pagefault.c
> > @@ -0,0 +1,94 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2025 Intel Corporation
> > + */
> > +
> > +#include "abi/guc_actions_abi.h"
> > +#include "xe_guc.h"
> > +#include "xe_guc_ct.h"
> > +#include "xe_guc_pagefault.h"
> > +#include "xe_pagefault.h"
> > +
> > +static void guc_ack_fault(struct xe_pagefault *pf, int err)
> > +{
> > +	u32 vfid = FIELD_GET(PFD_VFID, pf->producer.msg[2]);
> > +	u32 engine_instance = FIELD_GET(PFD_ENG_INSTANCE, pf->producer.msg[0]);
> > +	u32 engine_class = FIELD_GET(PFD_ENG_CLASS, pf->producer.msg[0]);
> > +	u32 pdata = FIELD_GET(PFD_PDATA_LO, pf->producer.msg[0]) |
> > +		(FIELD_GET(PFD_PDATA_HI, pf->producer.msg[1]) <<
> > +		 PFD_PDATA_HI_SHIFT);
> > +	u32 action[] = {
> > +		XE_GUC_ACTION_PAGE_FAULT_RES_DESC,
> > +
> > +		FIELD_PREP(PFR_VALID, 1) |
> > +		FIELD_PREP(PFR_SUCCESS, !!err) |
> > +		FIELD_PREP(PFR_REPLY, PFR_ACCESS) |
> > +		FIELD_PREP(PFR_DESC_TYPE, FAULT_RESPONSE_DESC) |
> > +		FIELD_PREP(PFR_ASID, pf->consumer.asid),
> > +
> > +		FIELD_PREP(PFR_VFID, vfid) |
> > +		FIELD_PREP(PFR_ENG_INSTANCE, engine_instance) |
> > +		FIELD_PREP(PFR_ENG_CLASS, engine_class) |
> > +		FIELD_PREP(PFR_PDATA, pdata),
> > +	};
> > +	struct xe_guc *guc = pf->producer.private;
> > +
> > +	xe_guc_ct_send(&guc->ct, action, ARRAY_SIZE(action), 0, 0);
> > +}
> > +
> > +static const struct xe_pagefault_ops guc_pagefault_ops = {
> > +	.ack_fault = guc_ack_fault,
> > +};
> > +
> > +/**
> > + * xe_guc_pagefault_handler() - G2H page fault handler
> > + * @guc: GuC object
> > + * @msg: G2H message
> > + * @len: Length of G2H message
> > + *
> > + * Parse GuC to host (G2H) message into a struct xe_pagefault and forward onto
> > + * the Xe page fault layer.
> > + *
> > + * Return: 0 on success, errno on failure
> > + */
> > +int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len)
> > +{
> > +	struct xe_pagefault pf;
> > +	int i;
> > +
> > +#define GUC_PF_MSG_LEN_DW	\
> > +	(sizeof(struct xe_guc_pagefault_desc) / sizeof(u32))
> > +
> > +	BUILD_BUG_ON(GUC_PF_MSG_LEN_DW > XE_PAGEFAULT_PRODUCER_MSG_LEN_DW);
> > +
> > +	if (len != GUC_PF_MSG_LEN_DW)
> > +		return -EPROTO;
> > +
> > +	pf.gt = guc_to_gt(guc);
> > +
> > +	/*
> > +	 * XXX: These values happen to match the enum in xe_pagefault_types.h.
> > +	 * If that changes, we’ll need to remap them here.
> > +	 */
> > +	pf.consumer.page_addr = (u64)(FIELD_GET(PFD_VIRTUAL_ADDR_HI, msg[3])
> > +				      << PFD_VIRTUAL_ADDR_HI_SHIFT) |
> > +		(FIELD_GET(PFD_VIRTUAL_ADDR_LO, msg[2]) <<
> > +		 PFD_VIRTUAL_ADDR_LO_SHIFT);
> > +	pf.consumer.asid = FIELD_GET(PFD_ASID, msg[1]);
> > +	pf.consumer.access_type = FIELD_GET(PFD_ACCESS_TYPE, msg[2]);;
> > +	pf.consumer.fault_type = FIELD_GET(PFD_FAULT_TYPE, msg[2]);
> > +	if (FIELD_GET(XE2_PFD_TRVA_FAULT, msg[0]))
> > +		pf.consumer.fault_level = XE_PAGEFAULT_LEVEL_NACK;
> > +	else
> > +		pf.consumer.fault_level = FIELD_GET(PFD_FAULT_LEVEL, msg[0]);
> > +	pf.consumer.engine_class = FIELD_GET(PFD_ENG_CLASS, msg[0]);
> > +
> > +	pf.producer.private = guc;
> > +	pf.producer.ops = &guc_pagefault_ops;
> > +	for (i = 0; i < GUC_PF_MSG_LEN_DW; ++i)
> > +		pf.producer.msg[i] = msg[i];
> > +
> > +#undef GUC_PF_MSG_LEN_DW
> > +
> > +	return xe_pagefault_handler(guc_to_xe(guc), &pf);
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_guc_pagefault.h b/drivers/gpu/drm/xe/xe_guc_pagefault.h
> > new file mode 100644
> > index 000000000000..0723f57b8ea9
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_guc_pagefault.h
> > @@ -0,0 +1,13 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2025 Intel Corporation
> > + */
> > +
> > +#ifndef _XE_GUC_PAGEFAULT_H_
> > +#define _XE_GUC_PAGEFAULT_H_
> > +
> > +#include <linux/types.h>
> 
> For this to compile we are missing:
> 
>     struct xe_guc;
> 

Yes. This was compiling on my machine without this but something
recently changed and this is required. Will fix.

> Francois
> 
> > +
> > +int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len);
> > +
> > +#endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> > index 10c8a1bcb86e..1bcf3ba3b350 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -109,8 +109,7 @@ xe_svm_garbage_collector_add_range(struct xe_vm *vm, struct xe_svm_range *range,
> >  			      &vm->svm.garbage_collector.range_list);
> >  	spin_unlock(&vm->svm.garbage_collector.lock);
> >  
> > -	queue_work(xe_device_get_root_tile(xe)->primary_gt->usm.pf_wq,
> > -		   &vm->svm.garbage_collector.work);
> > +	queue_work(xe->usm.pf_wq, &vm->svm.garbage_collector.work);
> >  }
> >  
> >  static u8
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index 432ea325677d..c9ae13c32117 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -27,7 +27,6 @@
> >  #include "xe_device.h"
> >  #include "xe_drm_client.h"
> >  #include "xe_exec_queue.h"
> > -#include "xe_gt_pagefault.h"
> >  #include "xe_gt_tlb_invalidation.h"
> >  #include "xe_migrate.h"
> >  #include "xe_pat.h"
> > -- 
> > 2.34.1
> > 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/11] drm/xe: Implement xe_pagefault_queue_work
  2025-08-28 12:29   ` Francois Dugast
@ 2025-08-28 18:39     ` Matthew Brost
  0 siblings, 0 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-28 18:39 UTC (permalink / raw)
  To: Francois Dugast
  Cc: intel-xe, thomas.hellstrom, himal.prasad.ghimiray, michal.mrozek

On Thu, Aug 28, 2025 at 02:29:05PM +0200, Francois Dugast wrote:
> On Tue, Aug 05, 2025 at 11:22:36PM -0700, Matthew Brost wrote:
> > Implement a worker that services page faults, using the same
> > implementation as in xe_gt_pagefault.c.
> 
> Also the minor refactoring and cleanup along the way helps readability.
> 
> A few nits below.
> 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_pagefault.c | 240 +++++++++++++++++++++++++++++-
> >  1 file changed, 239 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
> > index 98be3203a9df..474412c21ec3 100644
> > --- a/drivers/gpu/drm/xe/xe_pagefault.c
> > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > @@ -5,12 +5,20 @@
> >  
> >  #include <linux/circ_buf.h>
> >  
> > +#include <drm/drm_exec.h>
> >  #include <drm/drm_managed.h>
> >  
> > +#include "xe_bo.h"
> >  #include "xe_device.h"
> > +#include "xe_gt_printk.h"
> >  #include "xe_gt_types.h"
> > +#include "xe_gt_stats.h"
> 
> Move up to maintain alphabetically ordered.
> 

Yes, will do.

> > +#include "xe_hw_engine.h"
> >  #include "xe_pagefault.h"
> >  #include "xe_pagefault_types.h"
> > +#include "xe_svm.h"
> > +#include "xe_trace_bo.h"
> > +#include "xe_vm.h"
> >  
> >  /**
> >   * DOC: Xe page faults
> > @@ -30,9 +38,239 @@ static int xe_pagefault_entry_size(void)
> >  	return roundup_pow_of_two(sizeof(struct xe_pagefault));
> >  }
> >  
> > +static int xe_pagefault_begin(struct drm_exec *exec, struct xe_vma *vma,
> > +			      bool atomic, unsigned int id)
> 
> Please rename id to tile_id for clarity.
> 

Yep. Will change.

Matt

> Francois
> 
> > +{
> > +	struct xe_bo *bo = xe_vma_bo(vma);
> > +	struct xe_vm *vm = xe_vma_vm(vma);
> > +	int err;
> > +
> > +	err = xe_vm_lock_vma(exec, vma);
> > +	if (err)
> > +		return err;
> > +
> > +	if (atomic && IS_DGFX(vm->xe)) {
> > +		if (xe_vma_is_userptr(vma)) {
> > +			err = -EACCES;
> > +			return err;
> > +		}
> > +
> > +		/* Migrate to VRAM, move should invalidate the VMA first */
> > +		err = xe_bo_migrate(bo, XE_PL_VRAM0 + id);
> > +		if (err)
> > +			return err;
> > +	} else if (bo) {
> > +		/* Create backing store if needed */
> > +		err = xe_bo_validate(bo, vm, true);
> > +		if (err)
> > +			return err;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int xe_pagefault_handle_vma(struct xe_gt *gt, struct xe_vma *vma,
> > +				   bool atomic)
> > +{
> > +	struct xe_vm *vm = xe_vma_vm(vma);
> > +	struct xe_tile *tile = gt_to_tile(gt);
> > +	struct drm_exec exec;
> > +	struct dma_fence *fence;
> > +	ktime_t end = 0;
> > +	int err;
> > +
> > +	lockdep_assert_held_write(&vm->lock);
> > +
> > +	xe_gt_stats_incr(gt, XE_GT_STATS_ID_VMA_PAGEFAULT_COUNT, 1);
> > +	xe_gt_stats_incr(gt, XE_GT_STATS_ID_VMA_PAGEFAULT_KB,
> > +			 xe_vma_size(vma) / SZ_1K);
> > +
> > +	trace_xe_vma_pagefault(vma);
> > +
> > +	/* Check if VMA is valid, opportunistic check only */
> > +	if (xe_vm_has_valid_gpu_mapping(tile, vma->tile_present,
> > +					vma->tile_invalidated) && !atomic)
> > +		return 0;
> > +
> > +retry_userptr:
> > +	if (xe_vma_is_userptr(vma) &&
> > +	    xe_vma_userptr_check_repin(to_userptr_vma(vma))) {
> > +		struct xe_userptr_vma *uvma = to_userptr_vma(vma);
> > +
> > +		err = xe_vma_userptr_pin_pages(uvma);
> > +		if (err)
> > +			return err;
> > +	}
> > +
> > +	/* Lock VM and BOs dma-resv */
> > +	drm_exec_init(&exec, 0, 0);
> > +	drm_exec_until_all_locked(&exec) {
> > +		err = xe_pagefault_begin(&exec, vma, atomic, tile->id);
> > +		drm_exec_retry_on_contention(&exec);
> > +		if (xe_vm_validate_should_retry(&exec, err, &end))
> > +			err = -EAGAIN;
> > +		if (err)
> > +			goto unlock_dma_resv;
> > +
> > +		/* Bind VMA only to the GT that has faulted */
> > +		trace_xe_vma_pf_bind(vma);
> > +		fence = xe_vma_rebind(vm, vma, BIT(tile->id));
> > +		if (IS_ERR(fence)) {
> > +			err = PTR_ERR(fence);
> > +			if (xe_vm_validate_should_retry(&exec, err, &end))
> > +				err = -EAGAIN;
> > +			goto unlock_dma_resv;
> > +		}
> > +	}
> > +
> > +	dma_fence_wait(fence, false);
> > +	dma_fence_put(fence);
> > +
> > +unlock_dma_resv:
> > +	drm_exec_fini(&exec);
> > +	if (err == -EAGAIN)
> > +		goto retry_userptr;
> > +
> > +	return err;
> > +}
> > +
> > +static bool
> > +xe_pagefault_access_is_atomic(enum xe_pagefault_access_type access_type)
> > +{
> > +	return access_type == XE_PAGEFAULT_ACCESS_TYPE_ATOMIC;
> > +}
> > +
> > +static struct xe_vm *xe_pagefault_asid_to_vm(struct xe_device *xe, u32 asid)
> > +{
> > +	struct xe_vm *vm;
> > +
> > +	down_read(&xe->usm.lock);
> > +	vm = xa_load(&xe->usm.asid_to_vm, asid);
> > +	if (vm && xe_vm_in_fault_mode(vm))
> > +		xe_vm_get(vm);
> > +	else
> > +		vm = ERR_PTR(-EINVAL);
> > +	up_read(&xe->usm.lock);
> > +
> > +	return vm;
> > +}
> > +
> > +static int xe_pagefault_service(struct xe_pagefault *pf)
> > +{
> > +	struct xe_gt *gt = pf->gt;
> > +	struct xe_device *xe = gt_to_xe(gt);
> > +	struct xe_vm *vm;
> > +	struct xe_vma *vma = NULL;
> > +	int err;
> > +	bool atomic;
> > +
> > +	/* Producer flagged this fault to be nacked */
> > +	if (pf->consumer.fault_level == XE_PAGEFAULT_LEVEL_NACK)
> > +		return -EFAULT;
> > +
> > +	vm = xe_pagefault_asid_to_vm(xe, pf->consumer.asid);
> > +	if (IS_ERR(vm))
> > +		return PTR_ERR(vm);
> > +
> > +	/*
> > +	 * TODO: Change to read lock? Using write lock for simplicity.
> > +	 */
> > +	down_write(&vm->lock);
> > +
> > +	if (xe_vm_is_closed(vm)) {
> > +		err = -ENOENT;
> > +		goto unlock_vm;
> > +	}
> > +
> > +	vma = xe_vm_find_vma_by_addr(vm, pf->consumer.page_addr);
> > +	if (!vma) {
> > +		err = -EINVAL;
> > +		goto unlock_vm;
> > +	}
> > +
> > +	atomic = xe_pagefault_access_is_atomic(pf->consumer.access_type);
> > +
> > +	if (xe_vma_is_cpu_addr_mirror(vma))
> > +		err = xe_svm_handle_pagefault(vm, vma, gt,
> > +					      pf->consumer.page_addr, atomic);
> > +	else
> > +		err = xe_pagefault_handle_vma(gt, vma, atomic);
> > +
> > +unlock_vm:
> > +	if (!err)
> > +		vm->usm.last_fault_vma = vma;
> > +	up_write(&vm->lock);
> > +	xe_vm_put(vm);
> > +
> > +	return err;
> > +}
> > +
> > +static bool xe_pagefault_queue_pop(struct xe_pagefault_queue *pf_queue,
> > +				   struct xe_pagefault *pf)
> > +{
> > +	bool found_fault = false;
> > +
> > +	spin_lock_irq(&pf_queue->lock);
> > +	if (pf_queue->tail != pf_queue->head) {
> > +		memcpy(pf, pf_queue->data + pf_queue->tail, sizeof(*pf));
> > +		pf_queue->tail = (pf_queue->tail + xe_pagefault_entry_size()) %
> > +			pf_queue->size;
> > +		found_fault = true;
> > +	}
> > +	spin_unlock_irq(&pf_queue->lock);
> > +
> > +	return found_fault;
> > +}
> > +
> > +static void xe_pagefault_print(struct xe_pagefault *pf)
> > +{
> > +	xe_gt_dbg(pf->gt, "\n\tASID: %d\n"
> > +		  "\tFaulted Address: 0x%08x%08x\n"
> > +		  "\tFaultType: %d\n"
> > +		  "\tAccessType: %d\n"
> > +		  "\tFaultLevel: %d\n"
> > +		  "\tEngineClass: %d %s\n",
> > +		  pf->consumer.asid,
> > +		  upper_32_bits(pf->consumer.page_addr),
> > +		  lower_32_bits(pf->consumer.page_addr),
> > +		  pf->consumer.fault_type,
> > +		  pf->consumer.access_type,
> > +		  pf->consumer.fault_level,
> > +		  pf->consumer.engine_class,
> > +		  xe_hw_engine_class_to_str(pf->consumer.engine_class));
> > +}
> > +
> >  static void xe_pagefault_queue_work(struct work_struct *w)
> >  {
> > -	/* TODO: Implement */
> > +	struct xe_pagefault_queue *pf_queue =
> > +		container_of(w, typeof(*pf_queue), worker);
> > +	struct xe_pagefault pf;
> > +	unsigned long threshold;
> > +
> > +#define USM_QUEUE_MAX_RUNTIME_MS      20
> > +	threshold = jiffies + msecs_to_jiffies(USM_QUEUE_MAX_RUNTIME_MS);
> > +
> > +	while (xe_pagefault_queue_pop(pf_queue, &pf)) {
> > +		int err;
> > +
> > +		if (!pf.gt)	/* Fault squashed during reset */
> > +			continue;
> > +
> > +		err = xe_pagefault_service(&pf);
> > +		if (err) {
> > +			xe_pagefault_print(&pf);
> > +			xe_gt_dbg(pf.gt, "Fault response: Unsuccessful %pe\n",
> > +				  ERR_PTR(err));
> > +		}
> > +
> > +		pf.producer.ops->ack_fault(&pf, err);
> > +
> > +		if (time_after(jiffies, threshold)) {
> > +			queue_work(gt_to_xe(pf.gt)->usm.pf_wq, w);
> > +			break;
> > +		}
> > +	}
> > +#undef USM_QUEUE_MAX_RUNTIME_MS
> >  }
> >  
> >  static int xe_pagefault_queue_init(struct xe_device *xe,
> > -- 
> > 2.34.1
> > 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/11] drm/xe: Remove unused GT page fault code
  2025-08-06  6:22 ` [PATCH 07/11] drm/xe: Remove unused GT page fault code Matthew Brost
@ 2025-08-28 19:13   ` Summers, Stuart
  0 siblings, 0 replies; 51+ messages in thread
From: Summers, Stuart @ 2025-08-28 19:13 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Brost,  Matthew
  Cc: Mrozek, Michal, Ghimiray, Himal Prasad,
	thomas.hellstrom@linux.intel.com, Dugast, Francois

On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> With the Xe page fault layer and GuC page layer in place, this is now
> dead code and can be removed. ACC code is also removed, but this was
> dead code.

Ok so I was planning on posting some UAPI ideas around access counters
here - which of course needs buy-in from the compute UMD. Maybe we can
merge this first and I'll build the reintegration on top of what you
have here. Let me know if you had another plan there...

Thanks,
Stuart

> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_gt_pagefault.c | 691 -------------------------
> --
>  drivers/gpu/drm/xe/xe_gt_pagefault.h |  19 -
>  drivers/gpu/drm/xe/xe_gt_types.h     |  65 ---
>  3 files changed, 775 deletions(-)
>  delete mode 100644 drivers/gpu/drm/xe/xe_gt_pagefault.c
>  delete mode 100644 drivers/gpu/drm/xe/xe_gt_pagefault.h
> 
> diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> deleted file mode 100644
> index ab43dec52776..000000000000
> --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> +++ /dev/null
> @@ -1,691 +0,0 @@
> -// SPDX-License-Identifier: MIT
> -/*
> - * Copyright © 2022 Intel Corporation
> - */
> -
> -#include "xe_gt_pagefault.h"
> -
> -#include <linux/bitfield.h>
> -#include <linux/circ_buf.h>
> -
> -#include <drm/drm_exec.h>
> -#include <drm/drm_managed.h>
> -
> -#include "abi/guc_actions_abi.h"
> -#include "xe_bo.h"
> -#include "xe_gt.h"
> -#include "xe_gt_printk.h"
> -#include "xe_gt_stats.h"
> -#include "xe_gt_tlb_invalidation.h"
> -#include "xe_guc.h"
> -#include "xe_guc_ct.h"
> -#include "xe_migrate.h"
> -#include "xe_svm.h"
> -#include "xe_trace_bo.h"
> -#include "xe_vm.h"
> -#include "xe_vram_types.h"
> -
> -struct pagefault {
> -       u64 page_addr;
> -       u32 asid;
> -       u16 pdata;
> -       u8 vfid;
> -       u8 access_type;
> -       u8 fault_type;
> -       u8 fault_level;
> -       u8 engine_class;
> -       u8 engine_instance;
> -       u8 fault_unsuccessful;
> -       bool trva_fault;
> -};
> -
> -enum access_type {
> -       ACCESS_TYPE_READ = 0,
> -       ACCESS_TYPE_WRITE = 1,
> -       ACCESS_TYPE_ATOMIC = 2,
> -       ACCESS_TYPE_RESERVED = 3,
> -};
> -
> -enum fault_type {
> -       NOT_PRESENT = 0,
> -       WRITE_ACCESS_VIOLATION = 1,
> -       ATOMIC_ACCESS_VIOLATION = 2,
> -};
> -
> -struct acc {
> -       u64 va_range_base;
> -       u32 asid;
> -       u32 sub_granularity;
> -       u8 granularity;
> -       u8 vfid;
> -       u8 access_type;
> -       u8 engine_class;
> -       u8 engine_instance;
> -};
> -
> -static bool access_is_atomic(enum access_type access_type)
> -{
> -       return access_type == ACCESS_TYPE_ATOMIC;
> -}
> -
> -static bool vma_is_valid(struct xe_tile *tile, struct xe_vma *vma)
> -{
> -       return xe_vm_has_valid_gpu_mapping(tile, vma->tile_present,
> -                                          vma->tile_invalidated);
> -}
> -
> -static int xe_pf_begin(struct drm_exec *exec, struct xe_vma *vma,
> -                      bool atomic, struct xe_vram_region *vram)
> -{
> -       struct xe_bo *bo = xe_vma_bo(vma);
> -       struct xe_vm *vm = xe_vma_vm(vma);
> -       int err;
> -
> -       err = xe_vm_lock_vma(exec, vma);
> -       if (err)
> -               return err;
> -
> -       if (atomic && vram) {
> -               xe_assert(vm->xe, IS_DGFX(vm->xe));
> -
> -               if (xe_vma_is_userptr(vma)) {
> -                       err = -EACCES;
> -                       return err;
> -               }
> -
> -               /* Migrate to VRAM, move should invalidate the VMA
> first */
> -               err = xe_bo_migrate(bo, vram->placement);
> -               if (err)
> -                       return err;
> -       } else if (bo) {
> -               /* Create backing store if needed */
> -               err = xe_bo_validate(bo, vm, true);
> -               if (err)
> -                       return err;
> -       }
> -
> -       return 0;
> -}
> -
> -static int handle_vma_pagefault(struct xe_gt *gt, struct xe_vma
> *vma,
> -                               bool atomic)
> -{
> -       struct xe_vm *vm = xe_vma_vm(vma);
> -       struct xe_tile *tile = gt_to_tile(gt);
> -       struct drm_exec exec;
> -       struct dma_fence *fence;
> -       ktime_t end = 0;
> -       int err;
> -
> -       lockdep_assert_held_write(&vm->lock);
> -
> -       xe_gt_stats_incr(gt, XE_GT_STATS_ID_VMA_PAGEFAULT_COUNT, 1);
> -       xe_gt_stats_incr(gt, XE_GT_STATS_ID_VMA_PAGEFAULT_KB,
> xe_vma_size(vma) / 1024);
> -
> -       trace_xe_vma_pagefault(vma);
> -
> -       /* Check if VMA is valid, opportunistic check only */
> -       if (vma_is_valid(tile, vma) && !atomic)
> -               return 0;
> -
> -retry_userptr:
> -       if (xe_vma_is_userptr(vma) &&
> -           xe_vma_userptr_check_repin(to_userptr_vma(vma))) {
> -               struct xe_userptr_vma *uvma = to_userptr_vma(vma);
> -
> -               err = xe_vma_userptr_pin_pages(uvma);
> -               if (err)
> -                       return err;
> -       }
> -
> -       /* Lock VM and BOs dma-resv */
> -       drm_exec_init(&exec, 0, 0);
> -       drm_exec_until_all_locked(&exec) {
> -               err = xe_pf_begin(&exec, vma, atomic, tile-
> >mem.vram);
> -               drm_exec_retry_on_contention(&exec);
> -               if (xe_vm_validate_should_retry(&exec, err, &end))
> -                       err = -EAGAIN;
> -               if (err)
> -                       goto unlock_dma_resv;
> -
> -               /* Bind VMA only to the GT that has faulted */
> -               trace_xe_vma_pf_bind(vma);
> -               fence = xe_vma_rebind(vm, vma, BIT(tile->id));
> -               if (IS_ERR(fence)) {
> -                       err = PTR_ERR(fence);
> -                       if (xe_vm_validate_should_retry(&exec, err,
> &end))
> -                               err = -EAGAIN;
> -                       goto unlock_dma_resv;
> -               }
> -       }
> -
> -       dma_fence_wait(fence, false);
> -       dma_fence_put(fence);
> -
> -unlock_dma_resv:
> -       drm_exec_fini(&exec);
> -       if (err == -EAGAIN)
> -               goto retry_userptr;
> -
> -       return err;
> -}
> -
> -static struct xe_vm *asid_to_vm(struct xe_device *xe, u32 asid)
> -{
> -       struct xe_vm *vm;
> -
> -       down_read(&xe->usm.lock);
> -       vm = xa_load(&xe->usm.asid_to_vm, asid);
> -       if (vm && xe_vm_in_fault_mode(vm))
> -               xe_vm_get(vm);
> -       else
> -               vm = ERR_PTR(-EINVAL);
> -       up_read(&xe->usm.lock);
> -
> -       return vm;
> -}
> -
> -static int handle_pagefault(struct xe_gt *gt, struct pagefault *pf)
> -{
> -       struct xe_device *xe = gt_to_xe(gt);
> -       struct xe_vm *vm;
> -       struct xe_vma *vma = NULL;
> -       int err;
> -       bool atomic;
> -
> -       /* SW isn't expected to handle TRTT faults */
> -       if (pf->trva_fault)
> -               return -EFAULT;
> -
> -       vm = asid_to_vm(xe, pf->asid);
> -       if (IS_ERR(vm))
> -               return PTR_ERR(vm);
> -
> -       /*
> -        * TODO: Change to read lock? Using write lock for
> simplicity.
> -        */
> -       down_write(&vm->lock);
> -
> -       if (xe_vm_is_closed(vm)) {
> -               err = -ENOENT;
> -               goto unlock_vm;
> -       }
> -
> -       vma = xe_vm_find_vma_by_addr(vm, pf->page_addr);
> -       if (!vma) {
> -               err = -EINVAL;
> -               goto unlock_vm;
> -       }
> -
> -       atomic = access_is_atomic(pf->access_type);
> -
> -       if (xe_vma_is_cpu_addr_mirror(vma))
> -               err = xe_svm_handle_pagefault(vm, vma, gt,
> -                                             pf->page_addr, atomic);
> -       else
> -               err = handle_vma_pagefault(gt, vma, atomic);
> -
> -unlock_vm:
> -       if (!err)
> -               vm->usm.last_fault_vma = vma;
> -       up_write(&vm->lock);
> -       xe_vm_put(vm);
> -
> -       return err;
> -}
> -
> -static int send_pagefault_reply(struct xe_guc *guc,
> -                               struct xe_guc_pagefault_reply *reply)
> -{
> -       u32 action[] = {
> -               XE_GUC_ACTION_PAGE_FAULT_RES_DESC,
> -               reply->dw0,
> -               reply->dw1,
> -       };
> -
> -       return xe_guc_ct_send(&guc->ct, action, ARRAY_SIZE(action),
> 0, 0);
> -}
> -
> -static void print_pagefault(struct xe_gt *gt, struct pagefault *pf)
> -{
> -       xe_gt_dbg(gt, "\n\tASID: %d\n"
> -                 "\tVFID: %d\n"
> -                 "\tPDATA: 0x%04x\n"
> -                 "\tFaulted Address: 0x%08x%08x\n"
> -                 "\tFaultType: %d\n"
> -                 "\tAccessType: %d\n"
> -                 "\tFaultLevel: %d\n"
> -                 "\tEngineClass: %d %s\n"
> -                 "\tEngineInstance: %d\n",
> -                 pf->asid, pf->vfid, pf->pdata, upper_32_bits(pf-
> >page_addr),
> -                 lower_32_bits(pf->page_addr),
> -                 pf->fault_type, pf->access_type, pf->fault_level,
> -                 pf->engine_class, xe_hw_engine_class_to_str(pf-
> >engine_class),
> -                 pf->engine_instance);
> -}
> -
> -#define PF_MSG_LEN_DW  4
> -
> -static bool get_pagefault(struct pf_queue *pf_queue, struct
> pagefault *pf)
> -{
> -       const struct xe_guc_pagefault_desc *desc;
> -       bool ret = false;
> -
> -       spin_lock_irq(&pf_queue->lock);
> -       if (pf_queue->tail != pf_queue->head) {
> -               desc = (const struct xe_guc_pagefault_desc *)
> -                       (pf_queue->data + pf_queue->tail);
> -
> -               pf->fault_level = FIELD_GET(PFD_FAULT_LEVEL, desc-
> >dw0);
> -               pf->trva_fault = FIELD_GET(XE2_PFD_TRVA_FAULT, desc-
> >dw0);
> -               pf->engine_class = FIELD_GET(PFD_ENG_CLASS, desc-
> >dw0);
> -               pf->engine_instance = FIELD_GET(PFD_ENG_INSTANCE,
> desc->dw0);
> -               pf->pdata = FIELD_GET(PFD_PDATA_HI, desc->dw1) <<
> -                       PFD_PDATA_HI_SHIFT;
> -               pf->pdata |= FIELD_GET(PFD_PDATA_LO, desc->dw0);
> -               pf->asid = FIELD_GET(PFD_ASID, desc->dw1);
> -               pf->vfid = FIELD_GET(PFD_VFID, desc->dw2);
> -               pf->access_type = FIELD_GET(PFD_ACCESS_TYPE, desc-
> >dw2);
> -               pf->fault_type = FIELD_GET(PFD_FAULT_TYPE, desc-
> >dw2);
> -               pf->page_addr = (u64)(FIELD_GET(PFD_VIRTUAL_ADDR_HI,
> desc->dw3)) <<
> -                       PFD_VIRTUAL_ADDR_HI_SHIFT;
> -               pf->page_addr |= FIELD_GET(PFD_VIRTUAL_ADDR_LO, desc-
> >dw2) <<
> -                       PFD_VIRTUAL_ADDR_LO_SHIFT;
> -
> -               pf_queue->tail = (pf_queue->tail + PF_MSG_LEN_DW) %
> -                       pf_queue->num_dw;
> -               ret = true;
> -       }
> -       spin_unlock_irq(&pf_queue->lock);
> -
> -       return ret;
> -}
> -
> -static bool pf_queue_full(struct pf_queue *pf_queue)
> -{
> -       lockdep_assert_held(&pf_queue->lock);
> -
> -       return CIRC_SPACE(pf_queue->head, pf_queue->tail,
> -                         pf_queue->num_dw) <=
> -               PF_MSG_LEN_DW;
> -}
> -
> -int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len)
> -{
> -       struct xe_gt *gt = guc_to_gt(guc);
> -       struct pf_queue *pf_queue;
> -       unsigned long flags;
> -       u32 asid;
> -       bool full;
> -
> -       if (unlikely(len != PF_MSG_LEN_DW))
> -               return -EPROTO;
> -
> -       asid = FIELD_GET(PFD_ASID, msg[1]);
> -       pf_queue = gt->usm.pf_queue + (asid % NUM_PF_QUEUE);
> -
> -       /*
> -        * The below logic doesn't work unless PF_QUEUE_NUM_DW %
> PF_MSG_LEN_DW == 0
> -        */
> -       xe_gt_assert(gt, !(pf_queue->num_dw % PF_MSG_LEN_DW));
> -
> -       spin_lock_irqsave(&pf_queue->lock, flags);
> -       full = pf_queue_full(pf_queue);
> -       if (!full) {
> -               memcpy(pf_queue->data + pf_queue->head, msg, len *
> sizeof(u32));
> -               pf_queue->head = (pf_queue->head + len) %
> -                       pf_queue->num_dw;
> -               queue_work(gt->usm.pf_wq, &pf_queue->worker);
> -       } else {
> -               xe_gt_warn(gt, "PageFault Queue full, shouldn't be
> possible\n");
> -       }
> -       spin_unlock_irqrestore(&pf_queue->lock, flags);
> -
> -       return full ? -ENOSPC : 0;
> -}
> -
> -#define USM_QUEUE_MAX_RUNTIME_MS       20
> -
> -static void pf_queue_work_func(struct work_struct *w)
> -{
> -       struct pf_queue *pf_queue = container_of(w, struct pf_queue,
> worker);
> -       struct xe_gt *gt = pf_queue->gt;
> -       struct xe_guc_pagefault_reply reply = {};
> -       struct pagefault pf = {};
> -       unsigned long threshold;
> -       int ret;
> -
> -       threshold = jiffies +
> msecs_to_jiffies(USM_QUEUE_MAX_RUNTIME_MS);
> -
> -       while (get_pagefault(pf_queue, &pf)) {
> -               ret = handle_pagefault(gt, &pf);
> -               if (unlikely(ret)) {
> -                       print_pagefault(gt, &pf);
> -                       pf.fault_unsuccessful = 1;
> -                       xe_gt_dbg(gt, "Fault response: Unsuccessful
> %pe\n", ERR_PTR(ret));
> -               }
> -
> -               reply.dw0 = FIELD_PREP(PFR_VALID, 1) |
> -                       FIELD_PREP(PFR_SUCCESS,
> pf.fault_unsuccessful) |
> -                       FIELD_PREP(PFR_REPLY, PFR_ACCESS) |
> -                       FIELD_PREP(PFR_DESC_TYPE,
> FAULT_RESPONSE_DESC) |
> -                       FIELD_PREP(PFR_ASID, pf.asid);
> -
> -               reply.dw1 = FIELD_PREP(PFR_VFID, pf.vfid) |
> -                       FIELD_PREP(PFR_ENG_INSTANCE,
> pf.engine_instance) |
> -                       FIELD_PREP(PFR_ENG_CLASS, pf.engine_class) |
> -                       FIELD_PREP(PFR_PDATA, pf.pdata);
> -
> -               send_pagefault_reply(&gt->uc.guc, &reply);
> -
> -               if (time_after(jiffies, threshold) &&
> -                   pf_queue->tail != pf_queue->head) {
> -                       queue_work(gt->usm.pf_wq, w);
> -                       break;
> -               }
> -       }
> -}
> -
> -static void acc_queue_work_func(struct work_struct *w);
> -
> -static void pagefault_fini(void *arg)
> -{
> -       struct xe_gt *gt = arg;
> -       struct xe_device *xe = gt_to_xe(gt);
> -
> -       if (!xe->info.has_usm)
> -               return;
> -
> -       destroy_workqueue(gt->usm.acc_wq);
> -       destroy_workqueue(gt->usm.pf_wq);
> -}
> -
> -static int xe_alloc_pf_queue(struct xe_gt *gt, struct pf_queue
> *pf_queue)
> -{
> -       struct xe_device *xe = gt_to_xe(gt);
> -       xe_dss_mask_t all_dss;
> -       int num_dss, num_eus;
> -
> -       bitmap_or(all_dss, gt->fuse_topo.g_dss_mask, gt-
> >fuse_topo.c_dss_mask,
> -                 XE_MAX_DSS_FUSE_BITS);
> -
> -       num_dss = bitmap_weight(all_dss, XE_MAX_DSS_FUSE_BITS);
> -       num_eus = bitmap_weight(gt->fuse_topo.eu_mask_per_dss,
> -                               XE_MAX_EU_FUSE_BITS) * num_dss;
> -
> -       /*
> -        * user can issue separate page faults per EU and per CS
> -        *
> -        * XXX: Multiplier required as compute UMD are getting PF
> queue errors
> -        * without it. Follow on why this multiplier is required.
> -        */
> -#define PF_MULTIPLIER  8
> -       pf_queue->num_dw =
> -               (num_eus + XE_NUM_HW_ENGINES) * PF_MSG_LEN_DW *
> PF_MULTIPLIER;
> -       pf_queue->num_dw = roundup_pow_of_two(pf_queue->num_dw);
> -#undef PF_MULTIPLIER
> -
> -       pf_queue->gt = gt;
> -       pf_queue->data = devm_kcalloc(xe->drm.dev, pf_queue->num_dw,
> -                                     sizeof(u32), GFP_KERNEL);
> -       if (!pf_queue->data)
> -               return -ENOMEM;
> -
> -       spin_lock_init(&pf_queue->lock);
> -       INIT_WORK(&pf_queue->worker, pf_queue_work_func);
> -
> -       return 0;
> -}
> -
> -int xe_gt_pagefault_init(struct xe_gt *gt)
> -{
> -       struct xe_device *xe = gt_to_xe(gt);
> -       int i, ret = 0;
> -
> -       if (!xe->info.has_usm)
> -               return 0;
> -
> -       for (i = 0; i < NUM_PF_QUEUE; ++i) {
> -               ret = xe_alloc_pf_queue(gt, &gt->usm.pf_queue[i]);
> -               if (ret)
> -                       return ret;
> -       }
> -       for (i = 0; i < NUM_ACC_QUEUE; ++i) {
> -               gt->usm.acc_queue[i].gt = gt;
> -               spin_lock_init(&gt->usm.acc_queue[i].lock);
> -               INIT_WORK(&gt->usm.acc_queue[i].worker,
> acc_queue_work_func);
> -       }
> -
> -       gt->usm.pf_wq =
> alloc_workqueue("xe_gt_page_fault_work_queue",
> -                                       WQ_UNBOUND | WQ_HIGHPRI,
> NUM_PF_QUEUE);
> -       if (!gt->usm.pf_wq)
> -               return -ENOMEM;
> -
> -       gt->usm.acc_wq =
> alloc_workqueue("xe_gt_access_counter_work_queue",
> -                                        WQ_UNBOUND | WQ_HIGHPRI,
> -                                        NUM_ACC_QUEUE);
> -       if (!gt->usm.acc_wq) {
> -               destroy_workqueue(gt->usm.pf_wq);
> -               return -ENOMEM;
> -       }
> -
> -       return devm_add_action_or_reset(xe->drm.dev, pagefault_fini,
> gt);
> -}
> -
> -void xe_gt_pagefault_reset(struct xe_gt *gt)
> -{
> -       struct xe_device *xe = gt_to_xe(gt);
> -       int i;
> -
> -       if (!xe->info.has_usm)
> -               return;
> -
> -       for (i = 0; i < NUM_PF_QUEUE; ++i) {
> -               spin_lock_irq(&gt->usm.pf_queue[i].lock);
> -               gt->usm.pf_queue[i].head = 0;
> -               gt->usm.pf_queue[i].tail = 0;
> -               spin_unlock_irq(&gt->usm.pf_queue[i].lock);
> -       }
> -
> -       for (i = 0; i < NUM_ACC_QUEUE; ++i) {
> -               spin_lock(&gt->usm.acc_queue[i].lock);
> -               gt->usm.acc_queue[i].head = 0;
> -               gt->usm.acc_queue[i].tail = 0;
> -               spin_unlock(&gt->usm.acc_queue[i].lock);
> -       }
> -}
> -
> -static int granularity_in_byte(int val)
> -{
> -       switch (val) {
> -       case 0:
> -               return SZ_128K;
> -       case 1:
> -               return SZ_2M;
> -       case 2:
> -               return SZ_16M;
> -       case 3:
> -               return SZ_64M;
> -       default:
> -               return 0;
> -       }
> -}
> -
> -static int sub_granularity_in_byte(int val)
> -{
> -       return (granularity_in_byte(val) / 32);
> -}
> -
> -static void print_acc(struct xe_gt *gt, struct acc *acc)
> -{
> -       xe_gt_warn(gt, "Access counter request:\n"
> -                  "\tType: %s\n"
> -                  "\tASID: %d\n"
> -                  "\tVFID: %d\n"
> -                  "\tEngine: %d:%d\n"
> -                  "\tGranularity: 0x%x KB Region/ %d KB sub-
> granularity\n"
> -                  "\tSub_Granularity Vector: 0x%08x\n"
> -                  "\tVA Range base: 0x%016llx\n",
> -                  acc->access_type ? "AC_NTFY_VAL" : "AC_TRIG_VAL",
> -                  acc->asid, acc->vfid, acc->engine_class, acc-
> >engine_instance,
> -                  granularity_in_byte(acc->granularity) / SZ_1K,
> -                  sub_granularity_in_byte(acc->granularity) / SZ_1K,
> -                  acc->sub_granularity, acc->va_range_base);
> -}
> -
> -static struct xe_vma *get_acc_vma(struct xe_vm *vm, struct acc *acc)
> -{
> -       u64 page_va = acc->va_range_base + (ffs(acc->sub_granularity)
> - 1) *
> -               sub_granularity_in_byte(acc->granularity);
> -
> -       return xe_vm_find_overlapping_vma(vm, page_va, SZ_4K);
> -}
> -
> -static int handle_acc(struct xe_gt *gt, struct acc *acc)
> -{
> -       struct xe_device *xe = gt_to_xe(gt);
> -       struct xe_tile *tile = gt_to_tile(gt);
> -       struct drm_exec exec;
> -       struct xe_vm *vm;
> -       struct xe_vma *vma;
> -       int ret = 0;
> -
> -       /* We only support ACC_TRIGGER at the moment */
> -       if (acc->access_type != ACC_TRIGGER)
> -               return -EINVAL;
> -
> -       vm = asid_to_vm(xe, acc->asid);
> -       if (IS_ERR(vm))
> -               return PTR_ERR(vm);
> -
> -       down_read(&vm->lock);
> -
> -       /* Lookup VMA */
> -       vma = get_acc_vma(vm, acc);
> -       if (!vma) {
> -               ret = -EINVAL;
> -               goto unlock_vm;
> -       }
> -
> -       trace_xe_vma_acc(vma);
> -
> -       /* Userptr or null can't be migrated, nothing to do */
> -       if (xe_vma_has_no_bo(vma))
> -               goto unlock_vm;
> -
> -       /* Lock VM and BOs dma-resv */
> -       drm_exec_init(&exec, 0, 0);
> -       drm_exec_until_all_locked(&exec) {
> -               ret = xe_pf_begin(&exec, vma, true, tile->mem.vram);
> -               drm_exec_retry_on_contention(&exec);
> -               if (ret)
> -                       break;
> -       }
> -
> -       drm_exec_fini(&exec);
> -unlock_vm:
> -       up_read(&vm->lock);
> -       xe_vm_put(vm);
> -
> -       return ret;
> -}
> -
> -#define make_u64(hi__, low__)  ((u64)(hi__) << 32 | (u64)(low__))
> -
> -#define ACC_MSG_LEN_DW        4
> -
> -static bool get_acc(struct acc_queue *acc_queue, struct acc *acc)
> -{
> -       const struct xe_guc_acc_desc *desc;
> -       bool ret = false;
> -
> -       spin_lock(&acc_queue->lock);
> -       if (acc_queue->tail != acc_queue->head) {
> -               desc = (const struct xe_guc_acc_desc *)
> -                       (acc_queue->data + acc_queue->tail);
> -
> -               acc->granularity = FIELD_GET(ACC_GRANULARITY, desc-
> >dw2);
> -               acc->sub_granularity = FIELD_GET(ACC_SUBG_HI, desc-
> >dw1) << 31 |
> -                       FIELD_GET(ACC_SUBG_LO, desc->dw0);
> -               acc->engine_class = FIELD_GET(ACC_ENG_CLASS, desc-
> >dw1);
> -               acc->engine_instance = FIELD_GET(ACC_ENG_INSTANCE,
> desc->dw1);
> -               acc->asid =  FIELD_GET(ACC_ASID, desc->dw1);
> -               acc->vfid =  FIELD_GET(ACC_VFID, desc->dw2);
> -               acc->access_type = FIELD_GET(ACC_TYPE, desc->dw0);
> -               acc->va_range_base = make_u64(desc->dw3 &
> ACC_VIRTUAL_ADDR_RANGE_HI,
> -                                             desc->dw2 &
> ACC_VIRTUAL_ADDR_RANGE_LO);
> -
> -               acc_queue->tail = (acc_queue->tail + ACC_MSG_LEN_DW)
> %
> -                                 ACC_QUEUE_NUM_DW;
> -               ret = true;
> -       }
> -       spin_unlock(&acc_queue->lock);
> -
> -       return ret;
> -}
> -
> -static void acc_queue_work_func(struct work_struct *w)
> -{
> -       struct acc_queue *acc_queue = container_of(w, struct
> acc_queue, worker);
> -       struct xe_gt *gt = acc_queue->gt;
> -       struct acc acc = {};
> -       unsigned long threshold;
> -       int ret;
> -
> -       threshold = jiffies +
> msecs_to_jiffies(USM_QUEUE_MAX_RUNTIME_MS);
> -
> -       while (get_acc(acc_queue, &acc)) {
> -               ret = handle_acc(gt, &acc);
> -               if (unlikely(ret)) {
> -                       print_acc(gt, &acc);
> -                       xe_gt_warn(gt, "ACC: Unsuccessful %pe\n",
> ERR_PTR(ret));
> -               }
> -
> -               if (time_after(jiffies, threshold) &&
> -                   acc_queue->tail != acc_queue->head) {
> -                       queue_work(gt->usm.acc_wq, w);
> -                       break;
> -               }
> -       }
> -}
> -
> -static bool acc_queue_full(struct acc_queue *acc_queue)
> -{
> -       lockdep_assert_held(&acc_queue->lock);
> -
> -       return CIRC_SPACE(acc_queue->head, acc_queue->tail,
> ACC_QUEUE_NUM_DW) <=
> -               ACC_MSG_LEN_DW;
> -}
> -
> -int xe_guc_access_counter_notify_handler(struct xe_guc *guc, u32
> *msg, u32 len)
> -{
> -       struct xe_gt *gt = guc_to_gt(guc);
> -       struct acc_queue *acc_queue;
> -       u32 asid;
> -       bool full;
> -
> -       /*
> -        * The below logic doesn't work unless ACC_QUEUE_NUM_DW %
> ACC_MSG_LEN_DW == 0
> -        */
> -       BUILD_BUG_ON(ACC_QUEUE_NUM_DW % ACC_MSG_LEN_DW);
> -
> -       if (unlikely(len != ACC_MSG_LEN_DW))
> -               return -EPROTO;
> -
> -       asid = FIELD_GET(ACC_ASID, msg[1]);
> -       acc_queue = &gt->usm.acc_queue[asid % NUM_ACC_QUEUE];
> -
> -       spin_lock(&acc_queue->lock);
> -       full = acc_queue_full(acc_queue);
> -       if (!full) {
> -               memcpy(acc_queue->data + acc_queue->head, msg,
> -                      len * sizeof(u32));
> -               acc_queue->head = (acc_queue->head + len) %
> ACC_QUEUE_NUM_DW;
> -               queue_work(gt->usm.acc_wq, &acc_queue->worker);
> -       } else {
> -               xe_gt_warn(gt, "ACC Queue full, dropping ACC\n");
> -       }
> -       spin_unlock(&acc_queue->lock);
> -
> -       return full ? -ENOSPC : 0;
> -}
> diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.h
> b/drivers/gpu/drm/xe/xe_gt_pagefault.h
> deleted file mode 100644
> index 839c065a5e4c..000000000000
> --- a/drivers/gpu/drm/xe/xe_gt_pagefault.h
> +++ /dev/null
> @@ -1,19 +0,0 @@
> -/* SPDX-License-Identifier: MIT */
> -/*
> - * Copyright © 2022 Intel Corporation
> - */
> -
> -#ifndef _XE_GT_PAGEFAULT_H_
> -#define _XE_GT_PAGEFAULT_H_
> -
> -#include <linux/types.h>
> -
> -struct xe_gt;
> -struct xe_guc;
> -
> -int xe_gt_pagefault_init(struct xe_gt *gt);
> -void xe_gt_pagefault_reset(struct xe_gt *gt);
> -int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len);
> -int xe_guc_access_counter_notify_handler(struct xe_guc *guc, u32
> *msg, u32 len);
> -
> -#endif /* _XE_GT_PAGEFAULT_ */
> diff --git a/drivers/gpu/drm/xe/xe_gt_types.h
> b/drivers/gpu/drm/xe/xe_gt_types.h
> index dfd4a16da5f0..48ff0c491ccc 100644
> --- a/drivers/gpu/drm/xe/xe_gt_types.h
> +++ b/drivers/gpu/drm/xe/xe_gt_types.h
> @@ -239,71 +239,6 @@ struct xe_gt {
>                  * operations (e.g. mmigrations, fixing page tables)
>                  */
>                 u16 reserved_bcs_instance;
> -               /** @usm.pf_wq: page fault work queue, unbound, high
> priority */
> -               struct workqueue_struct *pf_wq;
> -               /** @usm.acc_wq: access counter work queue, unbound,
> high priority */
> -               struct workqueue_struct *acc_wq;
> -               /**
> -                * @usm.pf_queue: Page fault queue used to sync
> faults so faults can
> -                * be processed not under the GuC CT lock. The queue
> is sized so
> -                * it can sync all possible faults (1 per physical
> engine).
> -                * Multiple queues exists for page faults from
> different VMs are
> -                * be processed in parallel.
> -                */
> -               struct pf_queue {
> -                       /** @usm.pf_queue.gt: back pointer to GT */
> -                       struct xe_gt *gt;
> -                       /** @usm.pf_queue.data: data in the page
> fault queue */
> -                       u32 *data;
> -                       /**
> -                        * @usm.pf_queue.num_dw: number of DWORDS in
> the page
> -                        * fault queue. Dynamically calculated based
> on the number
> -                        * of compute resources available.
> -                        */
> -                       u32 num_dw;
> -                       /**
> -                        * @usm.pf_queue.tail: tail pointer in DWs
> for page fault queue,
> -                        * moved by worker which processes faults
> (consumer).
> -                        */
> -                       u16 tail;
> -                       /**
> -                        * @usm.pf_queue.head: head pointer in DWs
> for page fault queue,
> -                        * moved by G2H handler (producer).
> -                        */
> -                       u16 head;
> -                       /** @usm.pf_queue.lock: protects page fault
> queue */
> -                       spinlock_t lock;
> -                       /** @usm.pf_queue.worker: to process page
> faults */
> -                       struct work_struct worker;
> -#define NUM_PF_QUEUE   4
> -               } pf_queue[NUM_PF_QUEUE];
> -               /**
> -                * @usm.acc_queue: Same as page fault queue, cannot
> process access
> -                * counters under CT lock.
> -                */
> -               struct acc_queue {
> -                       /** @usm.acc_queue.gt: back pointer to GT */
> -                       struct xe_gt *gt;
> -#define ACC_QUEUE_NUM_DW       128
> -                       /** @usm.acc_queue.data: data in the page
> fault queue */
> -                       u32 data[ACC_QUEUE_NUM_DW];
> -                       /**
> -                        * @usm.acc_queue.tail: tail pointer in DWs
> for access counter queue,
> -                        * moved by worker which processes counters
> -                        * (consumer).
> -                        */
> -                       u16 tail;
> -                       /**
> -                        * @usm.acc_queue.head: head pointer in DWs
> for access counter queue,
> -                        * moved by G2H handler (producer).
> -                        */
> -                       u16 head;
> -                       /** @usm.acc_queue.lock: protects page fault
> queue */
> -                       spinlock_t lock;
> -                       /** @usm.acc_queue.worker: to process access
> counters */
> -                       struct work_struct worker;
> -#define NUM_ACC_QUEUE  4
> -               } acc_queue[NUM_ACC_QUEUE];
>         } usm;
>  
>         /** @ordered_wq: used to serialize GT resets and TDRs */


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/11] drm/xe: Stub out new pagefault layer
  2025-08-06  6:22 ` [PATCH 01/11] drm/xe: Stub out new pagefault layer Matthew Brost
  2025-08-06 23:01   ` Summers, Stuart
  2025-08-27 15:29   ` Francois Dugast
@ 2025-08-28 20:08   ` Summers, Stuart
  2 siblings, 0 replies; 51+ messages in thread
From: Summers, Stuart @ 2025-08-28 20:08 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Brost,  Matthew
  Cc: Mrozek, Michal, Ghimiray, Himal Prasad,
	thomas.hellstrom@linux.intel.com, Dugast, Francois

On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> Stub out the new page fault layer and add kernel documentation. This
> is
> intended as a replacement for the GT page fault layer, enabling
> multiple
> producers to hook into a shared page fault consumer interface.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile             |   1 +
>  drivers/gpu/drm/xe/xe_pagefault.c       |  63 ++++++++++++
>  drivers/gpu/drm/xe/xe_pagefault.h       |  19 ++++
>  drivers/gpu/drm/xe/xe_pagefault_types.h | 125
> ++++++++++++++++++++++++
>  4 files changed, 208 insertions(+)
>  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.c
>  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.h
>  create mode 100644 drivers/gpu/drm/xe/xe_pagefault_types.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile
> b/drivers/gpu/drm/xe/Makefile
> index 8e0c3412a757..6fbebafe79c9 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -93,6 +93,7 @@ xe-y += xe_bb.o \
>         xe_nvm.o \
>         xe_oa.o \
>         xe_observation.o \
> +       xe_pagefault.o \
>         xe_pat.o \
>         xe_pci.o \
>         xe_pcode.o \
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> b/drivers/gpu/drm/xe/xe_pagefault.c
> new file mode 100644
> index 000000000000..3ce0e8d74b9d
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> @@ -0,0 +1,63 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#include "xe_pagefault.h"
> +#include "xe_pagefault_types.h"
> +
> +/**
> + * DOC: Xe page faults
> + *
> + * Xe page faults are handled in two layers. The producer layer
> interacts with
> + * hardware or firmware to receive and parse faults into struct
> xe_pagefault,
> + * then forwards them to the consumer. The consumer layer services
> the faults
> + * (e.g., memory migration, page table updates) and acknowledges the
> result back
> + * to the producer, which then forwards the results to the hardware
> or firmware.
> + * The consumer uses a page fault queue sized to absorb all
> potential faults and
> + * a multi-threaded worker to process them. Multiple producers are
> supported,
> + * with a single shared consumer.
> + */
> +
> +/**
> + * xe_pagefault_init() - Page fault init
> + * @xe: xe device instance
> + *
> + * Initialize Xe page fault state. Must be done after reading fuses.
> + *
> + * Return: 0 on Success, errno on failure
> + */
> +int xe_pagefault_init(struct xe_device *xe)
> +{
> +       /* TODO - implement */
> +       return 0;
> +}
> +
> +/**
> + * xe_pagefault_reset() - Page fault reset for a GT
> + * @xe: xe device instance
> + * @gt: GT being reset
> + *
> + * Reset the Xe page fault state for a GT; that is, squash any
> pending faults on
> + * the GT.
> + */
> +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
> +{
> +       /* TODO - implement */
> +}
> +
> +/**
> + * xe_pagefault_handler() - Page fault handler
> + * @xe: xe device instance
> + * @pf: Page fault
> + *
> + * Sink the page fault to a queue (i.e., a memory buffer) and queue
> a worker to
> + * service it. Safe to be called from IRQ or process context.
> Reclaim safe.
> + *
> + * Return: 0 on success, errno on failure
> + */
> +int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault
> *pf)
> +{
> +       /* TODO - implement */
> +       return 0;
> +}
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.h
> b/drivers/gpu/drm/xe/xe_pagefault.h
> new file mode 100644
> index 000000000000..bd0cdf9ed37f
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_pagefault.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#ifndef _XE_PAGEFAULT_H_
> +#define _XE_PAGEFAULT_H_
> +
> +struct xe_device;
> +struct xe_gt;
> +struct xe_pagefault;
> +
> +int xe_pagefault_init(struct xe_device *xe);
> +
> +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt);
> +
> +int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault
> *pf);
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h
> b/drivers/gpu/drm/xe/xe_pagefault_types.h
> new file mode 100644
> index 000000000000..fcff84f93dd8
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
> @@ -0,0 +1,125 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#ifndef _XE_PAGEFAULT_TYPES_H_
> +#define _XE_PAGEFAULT_TYPES_H_
> +
> +#include <linux/workqueue.h>
> +
> +struct xe_pagefault;
> +struct xe_gt;
> +
> +/** enum xe_pagefault_access_type - Xe page fault access type */
> +enum xe_pagefault_access_type {
> +       /** @XE_PAGEFAULT_ACCESS_TYPE_READ: Read access type */
> +       XE_PAGEFAULT_ACCESS_TYPE_READ   = 0,
> +       /** @XE_PAGEFAULT_ACCESS_TYPE_WRITE: Write access type */
> +       XE_PAGEFAULT_ACCESS_TYPE_WRITE  = 1,
> +       /** @XE_PAGEFAULT_ACCESS_TYPE_ATOMIC: Atomic access type */
> +       XE_PAGEFAULT_ACCESS_TYPE_ATOMIC = 2,
> +};
> +
> +/** enum xe_pagefault_type - Xe page fault type */
> +enum xe_pagefault_type {
> +       /** @XE_PAGEFAULT_TYPE_NOT_PRESENT: Not present */
> +       XE_PAGEFAULT_TYPE_NOT_PRESENT           = 0,
> +       /** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Write access
> violation */
> +       XE_PAGEFAULT_WRITE_ACCESS_VIOLATION     = 1,
> +       /** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Atomic access
> violation */
> +       XE_PAGEFAULT_ATOMIC_ACCESS_VIOLATION    = 2,
> +};
> +
> +/** struct xe_pagefault_ops - Xe pagefault ops (producer) */
> +struct xe_pagefault_ops {
> +       /**
> +        * @ack_fault: Ack fault
> +        * @pf: Page fault
> +        * @err: Error state of fault
> +        *
> +        * Page fault producer receives acknowledgment from the
> consumer and
> +        * sends the result to the HW/FW interface.
> +        */
> +       void (*ack_fault)(struct xe_pagefault *pf, int err);
> +};
> +
> +/**
> + * struct xe_pagefault - Xe page fault
> + *
> + * Generic page fault structure for communication between producer
> and consumer.
> + * Carefully sized to be 64 bytes.

Can you expand this slightly to:
Carefully sized to be 64 bytes to align with hardware requests

Since this also came up later in the series in the enum discussion.

> + */
> +struct xe_pagefault {
> +       /**
> +        * @gt: GT of fault
> +        *
> +        * XXX: We may want to decouple the GT from individual
> faults, as it's
> +        * unclear whether future platforms will always have a GT for
> all page
> +        * fault producers. Internally, the GT is used for stats,
> identifying
> +        * the appropriate VRAM region, and locating the migration
> queue.
> +        * Leaving this as-is for now, but we can revisit later to
> see if we
> +        * can convert it to use the Xe device pointer instead.
> +        */
> +       struct xe_gt *gt;
> +       /**
> +        * @consumer: State for the software handling the fault.
> Populated by
> +        * the producer and may be modified by the consumer to
> communicate
> +        * information back to the producer upon fault
> acknowledgment.
> +        */
> +       struct {
> +               /** @consumer.page_addr: address of page fault */
> +               u64 page_addr;
> +               /** @consumer.asid: address space ID */
> +               u32 asid;
> +               /** @consumer.access_type: access type */
> +               u8 access_type;
> +               /** @consumer.fault_type: fault type */
> +               u8 fault_type;
> +#define XE_PAGEFAULT_LEVEL_NACK                0xff    /* Producer
> indicates nack fault */
> +               /** @consumer.fault_level: fault level */
> +               u8 fault_level;
> +               /** @consumer.engine_class: engine class */
> +               u8 engine_class;
> +               /** consumer.reserved: reserved bits for future
> expansion */
> +               u64 reserved;
> +       } consumer;

Do we want this packed?

Thanks,
Stuart

> +       /**
> +        * @producer: State for the producer (i.e., HW/FW interface).
> Populated
> +        * by the producer and should not be modified—or even
> inspected—by the
> +        * consumer, except for calling operations.
> +        */
> +       struct {
> +               /** @producer.private: private pointer */
> +               void *private;
> +               /** @producer.ops: operations */
> +               const struct xe_pagefault_ops *ops;
> +#define XE_PAGEFAULT_PRODUCER_MSG_LEN_DW       4
> +               /**
> +                * producer.msg: page fault message, used by producer
> in fault
> +                * acknowledgement to formulate response to HW/FW
> interface.
> +                */
> +               u32 msg[XE_PAGEFAULT_PRODUCER_MSG_LEN_DW];
> +       } producer;
> +};
> +
> +/** struct xe_pagefault_queue: Xe pagefault queue (consumer) */
> +struct xe_pagefault_queue {
> +       /**
> +        * @data: Data in queue containing struct xe_pagefault,
> protected by
> +        * @lock
> +        */
> +       void *data;
> +       /** @size: Size of queue in bytes */
> +       u32 size;
> +       /** @head: Head pointer in bytes, moved by producer,
> protected by @lock */
> +       u32 head;
> +       /** @tail: Tail pointer in bytes, moved by consumer,
> protected by @lock */
> +       u32 tail;
> +       /** @lock: protects page fault queue */
> +       spinlock_t lock;
> +       /** @worker: to process page faults */
> +       struct work_struct worker;
> +};
> +
> +#endif


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/11] drm/xe: Implement xe_pagefault_init
  2025-08-06  6:22 ` [PATCH 02/11] drm/xe: Implement xe_pagefault_init Matthew Brost
  2025-08-06 23:08   ` Summers, Stuart
  2025-08-27 16:30   ` Francois Dugast
@ 2025-08-28 20:10   ` Summers, Stuart
  2025-08-28 20:14     ` Matthew Brost
  2 siblings, 1 reply; 51+ messages in thread
From: Summers, Stuart @ 2025-08-28 20:10 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Brost,  Matthew
  Cc: Mrozek, Michal, Ghimiray, Himal Prasad,
	thomas.hellstrom@linux.intel.com, Dugast, Francois

On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> Create pagefault queues and initialize them.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device.c       |  5 ++
>  drivers/gpu/drm/xe/xe_device_types.h |  6 ++
>  drivers/gpu/drm/xe/xe_pagefault.c    | 93
> +++++++++++++++++++++++++++-
>  3 files changed, 102 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.c
> b/drivers/gpu/drm/xe/xe_device.c
> index 57edbc63da6f..c7c8aee03841 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -50,6 +50,7 @@
>  #include "xe_nvm.h"
>  #include "xe_oa.h"
>  #include "xe_observation.h"
> +#include "xe_pagefault.h"
>  #include "xe_pat.h"
>  #include "xe_pcode.h"
>  #include "xe_pm.h"
> @@ -890,6 +891,10 @@ int xe_device_probe(struct xe_device *xe)
>         if (err)
>                 return err;
>  
> +       err = xe_pagefault_init(xe);
> +       if (err)
> +               return err;
> +
>         xe_nvm_init(xe);
>  
>         err = xe_heci_gsc_init(xe);
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> b/drivers/gpu/drm/xe/xe_device_types.h
> index 01e8fa0d2f9f..6aa119026ce9 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -17,6 +17,7 @@
>  #include "xe_lmtt_types.h"
>  #include "xe_memirq_types.h"
>  #include "xe_oa_types.h"
> +#include "xe_pagefault_types.h"
>  #include "xe_platform_types.h"
>  #include "xe_pmu_types.h"
>  #include "xe_pt_types.h"
> @@ -394,6 +395,11 @@ struct xe_device {
>                 u32 next_asid;
>                 /** @usm.lock: protects UM state */
>                 struct rw_semaphore lock;
> +               /** @usm.pf_wq: page fault work queue, unbound, high
> priority */
> +               struct workqueue_struct *pf_wq;
> +#define XE_PAGEFAULT_QUEUE_COUNT       4
> +               /** @pf_queue: Page fault queues */
> +               struct xe_pagefault_queue
> pf_queue[XE_PAGEFAULT_QUEUE_COUNT];
>         } usm;
>  
>         /** @pinned: pinned BO state */
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> b/drivers/gpu/drm/xe/xe_pagefault.c
> index 3ce0e8d74b9d..14304c41eb23 100644
> --- a/drivers/gpu/drm/xe/xe_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> @@ -3,6 +3,10 @@
>   * Copyright © 2025 Intel Corporation
>   */
>  
> +#include <drm/drm_managed.h>
> +
> +#include "xe_device.h"
> +#include "xe_gt_types.h"
>  #include "xe_pagefault.h"
>  #include "xe_pagefault_types.h"
>  
> @@ -19,6 +23,71 @@
>   * with a single shared consumer.
>   */
>  
> +static int xe_pagefault_entry_size(void)
> +{
> +       return roundup_pow_of_two(sizeof(struct xe_pagefault));

And here, it would be nice if you could add a brief comment that this
assumes the size of struct xe_pagefault aligns to the hardware
requirements.

Thanks,
Stuart

> +}
> +
> +static void xe_pagefault_queue_work(struct work_struct *w)
> +{
> +       /* TODO: Implement */
> +}
> +
> +static int xe_pagefault_queue_init(struct xe_device *xe,
> +                                  struct xe_pagefault_queue
> *pf_queue)
> +{
> +       struct xe_gt *gt;
> +       int total_num_eus = 0;
> +       u8 id;
> +
> +       for_each_gt(gt, xe, id) {
> +               xe_dss_mask_t all_dss;
> +               int num_dss, num_eus;
> +
> +               bitmap_or(all_dss, gt->fuse_topo.g_dss_mask,
> +                         gt->fuse_topo.c_dss_mask,
> XE_MAX_DSS_FUSE_BITS);
> +
> +               num_dss = bitmap_weight(all_dss,
> XE_MAX_DSS_FUSE_BITS);
> +               num_eus = bitmap_weight(gt-
> >fuse_topo.eu_mask_per_dss,
> +                                       XE_MAX_EU_FUSE_BITS) *
> num_dss;
> +
> +               total_num_eus += num_eus;
> +       }
> +
> +       xe_assert(xe, total_num_eus);
> +
> +       /*
> +        * user can issue separate page faults per EU and per CS
> +        *
> +        * XXX: Multiplier required as compute UMD are getting PF
> queue errors
> +        * without it. Follow on why this multiplier is required.
> +        */
> +#define PF_MULTIPLIER  8
> +       pf_queue->size = (total_num_eus + XE_NUM_HW_ENGINES) *
> +               xe_pagefault_entry_size() * PF_MULTIPLIER;
> +       pf_queue->size = roundup_pow_of_two(pf_queue->size);
> +#undef PF_MULTIPLIER
> +
> +       drm_dbg(&xe->drm, "xe_pagefault_entry_size=%d,
> total_num_eus=%d, pf_queue->size=%u",
> +               xe_pagefault_entry_size(), total_num_eus, pf_queue-
> >size);
> +
> +       pf_queue->data = devm_kzalloc(xe->drm.dev, pf_queue->size,
> GFP_KERNEL);
> +       if (!pf_queue->data)
> +               return -ENOMEM;
> +
> +       spin_lock_init(&pf_queue->lock);
> +       INIT_WORK(&pf_queue->worker, xe_pagefault_queue_work);
> +
> +       return 0;
> +}
> +
> +static void xe_pagefault_fini(void *arg)
> +{
> +       struct xe_device *xe = arg;
> +
> +       destroy_workqueue(xe->usm.pf_wq);
> +}
> +
>  /**
>   * xe_pagefault_init() - Page fault init
>   * @xe: xe device instance
> @@ -29,8 +98,28 @@
>   */
>  int xe_pagefault_init(struct xe_device *xe)
>  {
> -       /* TODO - implement */
> -       return 0;
> +       int err, i;
> +
> +       if (!xe->info.has_usm)
> +               return 0;
> +
> +       xe->usm.pf_wq = alloc_workqueue("xe_page_fault_work_queue",
> +                                       WQ_UNBOUND | WQ_HIGHPRI,
> +                                       XE_PAGEFAULT_QUEUE_COUNT);
> +       if (!xe->usm.pf_wq)
> +               return -ENOMEM;
> +
> +       for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i) {
> +               err = xe_pagefault_queue_init(xe, xe->usm.pf_queue +
> i);
> +               if (err)
> +                       goto err_out;
> +       }
> +
> +       return devm_add_action_or_reset(xe->drm.dev,
> xe_pagefault_fini, xe);
> +
> +err_out:
> +       destroy_workqueue(xe->usm.pf_wq);
> +       return err;
>  }
>  
>  /**


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/11] drm/xe: Implement xe_pagefault_init
  2025-08-28 20:10   ` Summers, Stuart
@ 2025-08-28 20:14     ` Matthew Brost
  2025-08-28 20:19       ` Summers, Stuart
  0 siblings, 1 reply; 51+ messages in thread
From: Matthew Brost @ 2025-08-28 20:14 UTC (permalink / raw)
  To: Summers, Stuart
  Cc: intel-xe@lists.freedesktop.org, Mrozek, Michal,
	Ghimiray, Himal Prasad, thomas.hellstrom@linux.intel.com,
	Dugast, Francois

On Thu, Aug 28, 2025 at 02:10:02PM -0600, Summers, Stuart wrote:
> On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> > Create pagefault queues and initialize them.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_device.c       |  5 ++
> >  drivers/gpu/drm/xe/xe_device_types.h |  6 ++
> >  drivers/gpu/drm/xe/xe_pagefault.c    | 93
> > +++++++++++++++++++++++++++-
> >  3 files changed, 102 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > b/drivers/gpu/drm/xe/xe_device.c
> > index 57edbc63da6f..c7c8aee03841 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -50,6 +50,7 @@
> >  #include "xe_nvm.h"
> >  #include "xe_oa.h"
> >  #include "xe_observation.h"
> > +#include "xe_pagefault.h"
> >  #include "xe_pat.h"
> >  #include "xe_pcode.h"
> >  #include "xe_pm.h"
> > @@ -890,6 +891,10 @@ int xe_device_probe(struct xe_device *xe)
> >         if (err)
> >                 return err;
> >  
> > +       err = xe_pagefault_init(xe);
> > +       if (err)
> > +               return err;
> > +
> >         xe_nvm_init(xe);
> >  
> >         err = xe_heci_gsc_init(xe);
> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > b/drivers/gpu/drm/xe/xe_device_types.h
> > index 01e8fa0d2f9f..6aa119026ce9 100644
> > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > @@ -17,6 +17,7 @@
> >  #include "xe_lmtt_types.h"
> >  #include "xe_memirq_types.h"
> >  #include "xe_oa_types.h"
> > +#include "xe_pagefault_types.h"
> >  #include "xe_platform_types.h"
> >  #include "xe_pmu_types.h"
> >  #include "xe_pt_types.h"
> > @@ -394,6 +395,11 @@ struct xe_device {
> >                 u32 next_asid;
> >                 /** @usm.lock: protects UM state */
> >                 struct rw_semaphore lock;
> > +               /** @usm.pf_wq: page fault work queue, unbound, high
> > priority */
> > +               struct workqueue_struct *pf_wq;
> > +#define XE_PAGEFAULT_QUEUE_COUNT       4
> > +               /** @pf_queue: Page fault queues */
> > +               struct xe_pagefault_queue
> > pf_queue[XE_PAGEFAULT_QUEUE_COUNT];
> >         } usm;
> >  
> >         /** @pinned: pinned BO state */
> > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> > b/drivers/gpu/drm/xe/xe_pagefault.c
> > index 3ce0e8d74b9d..14304c41eb23 100644
> > --- a/drivers/gpu/drm/xe/xe_pagefault.c
> > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > @@ -3,6 +3,10 @@
> >   * Copyright © 2025 Intel Corporation
> >   */
> >  
> > +#include <drm/drm_managed.h>
> > +
> > +#include "xe_device.h"
> > +#include "xe_gt_types.h"
> >  #include "xe_pagefault.h"
> >  #include "xe_pagefault_types.h"
> >  
> > @@ -19,6 +23,71 @@
> >   * with a single shared consumer.
> >   */
> >  
> > +static int xe_pagefault_entry_size(void)
> > +{
> > +       return roundup_pow_of_two(sizeof(struct xe_pagefault));
> 
> And here, it would be nice if you could add a brief comment that this
> assumes the size of struct xe_pagefault aligns to the hardware
> requirements.
> 

It is actually not a hardware thing, it the pagefault queue management
code (software) where the logic breaks if we are not doing everything on
pow2 boundaries. Ofc, this isn't a strick requirement, rather it just
makes the code simplier.

I can add a comment around this.

Matt

> Thanks,
> Stuart
> 
> > +}
> > +
> > +static void xe_pagefault_queue_work(struct work_struct *w)
> > +{
> > +       /* TODO: Implement */
> > +}
> > +
> > +static int xe_pagefault_queue_init(struct xe_device *xe,
> > +                                  struct xe_pagefault_queue
> > *pf_queue)
> > +{
> > +       struct xe_gt *gt;
> > +       int total_num_eus = 0;
> > +       u8 id;
> > +
> > +       for_each_gt(gt, xe, id) {
> > +               xe_dss_mask_t all_dss;
> > +               int num_dss, num_eus;
> > +
> > +               bitmap_or(all_dss, gt->fuse_topo.g_dss_mask,
> > +                         gt->fuse_topo.c_dss_mask,
> > XE_MAX_DSS_FUSE_BITS);
> > +
> > +               num_dss = bitmap_weight(all_dss,
> > XE_MAX_DSS_FUSE_BITS);
> > +               num_eus = bitmap_weight(gt-
> > >fuse_topo.eu_mask_per_dss,
> > +                                       XE_MAX_EU_FUSE_BITS) *
> > num_dss;
> > +
> > +               total_num_eus += num_eus;
> > +       }
> > +
> > +       xe_assert(xe, total_num_eus);
> > +
> > +       /*
> > +        * user can issue separate page faults per EU and per CS
> > +        *
> > +        * XXX: Multiplier required as compute UMD are getting PF
> > queue errors
> > +        * without it. Follow on why this multiplier is required.
> > +        */
> > +#define PF_MULTIPLIER  8
> > +       pf_queue->size = (total_num_eus + XE_NUM_HW_ENGINES) *
> > +               xe_pagefault_entry_size() * PF_MULTIPLIER;
> > +       pf_queue->size = roundup_pow_of_two(pf_queue->size);
> > +#undef PF_MULTIPLIER
> > +
> > +       drm_dbg(&xe->drm, "xe_pagefault_entry_size=%d,
> > total_num_eus=%d, pf_queue->size=%u",
> > +               xe_pagefault_entry_size(), total_num_eus, pf_queue-
> > >size);
> > +
> > +       pf_queue->data = devm_kzalloc(xe->drm.dev, pf_queue->size,
> > GFP_KERNEL);
> > +       if (!pf_queue->data)
> > +               return -ENOMEM;
> > +
> > +       spin_lock_init(&pf_queue->lock);
> > +       INIT_WORK(&pf_queue->worker, xe_pagefault_queue_work);
> > +
> > +       return 0;
> > +}
> > +
> > +static void xe_pagefault_fini(void *arg)
> > +{
> > +       struct xe_device *xe = arg;
> > +
> > +       destroy_workqueue(xe->usm.pf_wq);
> > +}
> > +
> >  /**
> >   * xe_pagefault_init() - Page fault init
> >   * @xe: xe device instance
> > @@ -29,8 +98,28 @@
> >   */
> >  int xe_pagefault_init(struct xe_device *xe)
> >  {
> > -       /* TODO - implement */
> > -       return 0;
> > +       int err, i;
> > +
> > +       if (!xe->info.has_usm)
> > +               return 0;
> > +
> > +       xe->usm.pf_wq = alloc_workqueue("xe_page_fault_work_queue",
> > +                                       WQ_UNBOUND | WQ_HIGHPRI,
> > +                                       XE_PAGEFAULT_QUEUE_COUNT);
> > +       if (!xe->usm.pf_wq)
> > +               return -ENOMEM;
> > +
> > +       for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i) {
> > +               err = xe_pagefault_queue_init(xe, xe->usm.pf_queue +
> > i);
> > +               if (err)
> > +                       goto err_out;
> > +       }
> > +
> > +       return devm_add_action_or_reset(xe->drm.dev,
> > xe_pagefault_fini, xe);
> > +
> > +err_out:
> > +       destroy_workqueue(xe->usm.pf_wq);
> > +       return err;
> >  }
> >  
> >  /**
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/11] drm/xe: Stub out new pagefault layer
  2025-08-07 18:10         ` Matthew Brost
@ 2025-08-28 20:18           ` Summers, Stuart
  2025-08-28 20:20             ` Matthew Brost
  0 siblings, 1 reply; 51+ messages in thread
From: Summers, Stuart @ 2025-08-28 20:18 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, thomas.hellstrom@linux.intel.com,
	Ghimiray, Himal Prasad, Dugast, Francois, Mrozek, Michal

On Thu, 2025-08-07 at 11:10 -0700, Matthew Brost wrote:
> On Thu, Aug 07, 2025 at 11:20:06AM -0600, Summers, Stuart wrote:
> > On Wed, 2025-08-06 at 16:53 -0700, Matthew Brost wrote:
> > > On Wed, Aug 06, 2025 at 05:01:12PM -0600, Summers, Stuart wrote:
> > > > Few basic comments below to start. I personally would rather
> > > > this
> > > > be
> > > > brought over from the existing fault handler rather than
> > > > creating
> > > > something entirely new and then clobbering the older stuff -
> > > > just
> > > > so
> > > > the review of message format requests/replies is easier to
> > > > review
> > > > and
> > > > where we're deviating from the existing external interfaces
> > > > (HW/FW/GuC/etc). You already have this here though so not a
> > > > huge
> > > > deal.
> > > > I think most of this was in the giant blob of patches that got
> > > > merged
> > > > with the initial driver, so I guess the counter argument is we
> > > > can
> > > > have
> > > > easy to reference historical reviews now.
> > > > 
> > > 
> > > Yes, page fault code is largely just a big blob from the original
> > > Xe
> > > patch that wasn't the most well thought out code. We still have
> > > that
> > > history in the tree, just git blame won't work, so you'd need to
> > > know
> > > where to look if want that.
> > > 
> > > I don't think there is a great way to pull this over, unless
> > > patches
> > > 2-7
> > > are squashed into a single patch + a couple of 'git mv' are used.
> > 
> > No definitely don't think that's worth it. Let's just review as-is.
> > 
> > > 
> > > > On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> > > > > Stub out the new page fault layer and add kernel
> > > > > documentation.
> > > > > This
> > > > > is
> > > > > intended as a replacement for the GT page fault layer,
> > > > > enabling
> > > > > multiple
> > > > > producers to hook into a shared page fault consumer
> > > > > interface.
> > > > > 
> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/xe/Makefile             |   1 +
> > > > >  drivers/gpu/drm/xe/xe_pagefault.c       |  63 ++++++++++++
> > > > >  drivers/gpu/drm/xe/xe_pagefault.h       |  19 ++++
> > > > >  drivers/gpu/drm/xe/xe_pagefault_types.h | 125
> > > > > ++++++++++++++++++++++++
> > > > >  4 files changed, 208 insertions(+)
> > > > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.c
> > > > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.h
> > > > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault_types.h
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > > > b/drivers/gpu/drm/xe/Makefile
> > > > > index 8e0c3412a757..6fbebafe79c9 100644
> > > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > > @@ -93,6 +93,7 @@ xe-y += xe_bb.o \
> > > > >         xe_nvm.o \
> > > > >         xe_oa.o \
> > > > >         xe_observation.o \
> > > > > +       xe_pagefault.o \
> > > > >         xe_pat.o \
> > > > >         xe_pci.o \
> > > > >         xe_pcode.o \
> > > > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> > > > > b/drivers/gpu/drm/xe/xe_pagefault.c
> > > > > new file mode 100644
> > > > > index 000000000000..3ce0e8d74b9d
> > > > > --- /dev/null
> > > > > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > > > > @@ -0,0 +1,63 @@
> > > > > +// SPDX-License-Identifier: MIT
> > > > > +/*
> > > > > + * Copyright © 2025 Intel Corporation
> > > > > + */
> > > > > +
> > > > > +#include "xe_pagefault.h"
> > > > > +#include "xe_pagefault_types.h"
> > > > > +
> > > > > +/**
> > > > > + * DOC: Xe page faults
> > > > > + *
> > > > > + * Xe page faults are handled in two layers. The producer
> > > > > layer
> > > > > interacts with
> > > > > + * hardware or firmware to receive and parse faults into
> > > > > struct
> > > > > xe_pagefault,
> > > > > + * then forwards them to the consumer. The consumer layer
> > > > > services
> > > > > the faults
> > > > > + * (e.g., memory migration, page table updates) and
> > > > > acknowledges
> > > > > the
> > > > > result back
> > > > > + * to the producer, which then forwards the results to the
> > > > > hardware
> > > > > or firmware.
> > > > > + * The consumer uses a page fault queue sized to absorb all
> > > > > potential faults and
> > > > > + * a multi-threaded worker to process them. Multiple
> > > > > producers
> > > > > are
> > > > > supported,
> > > > > + * with a single shared consumer.
> > > > > + */
> > > > > +
> > > > > +/**
> > > > > + * xe_pagefault_init() - Page fault init
> > > > > + * @xe: xe device instance
> > > > > + *
> > > > > + * Initialize Xe page fault state. Must be done after
> > > > > reading
> > > > > fuses.
> > > > > + *
> > > > > + * Return: 0 on Success, errno on failure
> > > > > + */
> > > > > +int xe_pagefault_init(struct xe_device *xe)
> > > > > +{
> > > > > +       /* TODO - implement */
> > > > > +       return 0;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * xe_pagefault_reset() - Page fault reset for a GT
> > > > > + * @xe: xe device instance
> > > > > + * @gt: GT being reset
> > > > > + *
> > > > > + * Reset the Xe page fault state for a GT; that is, squash
> > > > > any
> > > > > pending faults on
> > > > > + * the GT.
> > > > > + */
> > > > > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt
> > > > > *gt)
> > > > > +{
> > > > > +       /* TODO - implement */
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * xe_pagefault_handler() - Page fault handler
> > > > > + * @xe: xe device instance
> > > > > + * @pf: Page fault
> > > > > + *
> > > > > + * Sink the page fault to a queue (i.e., a memory buffer)
> > > > > and
> > > > > queue
> > > > > a worker to
> > > > > + * service it. Safe to be called from IRQ or process
> > > > > context.
> > > > > Reclaim safe.
> > > > > + *
> > > > > + * Return: 0 on success, errno on failure
> > > > > + */
> > > > > +int xe_pagefault_handler(struct xe_device *xe, struct
> > > > > xe_pagefault
> > > > > *pf)
> > > > > +{
> > > > > +       /* TODO - implement */
> > > > > +       return 0;
> > > > > +}
> > > > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.h
> > > > > b/drivers/gpu/drm/xe/xe_pagefault.h
> > > > > new file mode 100644
> > > > > index 000000000000..bd0cdf9ed37f
> > > > > --- /dev/null
> > > > > +++ b/drivers/gpu/drm/xe/xe_pagefault.h
> > > > > @@ -0,0 +1,19 @@
> > > > > +/* SPDX-License-Identifier: MIT */
> > > > > +/*
> > > > > + * Copyright © 2025 Intel Corporation
> > > > > + */
> > > > > +
> > > > > +#ifndef _XE_PAGEFAULT_H_
> > > > > +#define _XE_PAGEFAULT_H_
> > > > > +
> > > > > +struct xe_device;
> > > > > +struct xe_gt;
> > > > > +struct xe_pagefault;
> > > > > +
> > > > > +int xe_pagefault_init(struct xe_device *xe);
> > > > > +
> > > > > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt
> > > > > *gt);
> > > > > +
> > > > > +int xe_pagefault_handler(struct xe_device *xe, struct
> > > > > xe_pagefault
> > > > > *pf);
> > > > > +
> > > > > +#endif
> > > > > diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > > > b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > > > new file mode 100644
> > > > > index 000000000000..fcff84f93dd8
> > > > > --- /dev/null
> > > > > +++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > > > @@ -0,0 +1,125 @@
> > > > > +/* SPDX-License-Identifier: MIT */
> > > > > +/*
> > > > > + * Copyright © 2025 Intel Corporation
> > > > > + */
> > > > > +
> > > > > +#ifndef _XE_PAGEFAULT_TYPES_H_
> > > > > +#define _XE_PAGEFAULT_TYPES_H_
> > > > > +
> > > > > +#include <linux/workqueue.h>
> > > > > +
> > > > > +struct xe_pagefault;
> > > > > +struct xe_gt;
> > > > 
> > > > Nit: Maybe reverse these structs to be in alphabetical order
> > > > 
> > > 
> > > Yes, that is the preferred style. Will fix.
> > > 
> > > > > +
> > > > > +/** enum xe_pagefault_access_type - Xe page fault access
> > > > > type */
> > > > > +enum xe_pagefault_access_type {
> > > > > +       /** @XE_PAGEFAULT_ACCESS_TYPE_READ: Read access type
> > > > > */
> > > > > +       XE_PAGEFAULT_ACCESS_TYPE_READ   = 0,
> > > > > +       /** @XE_PAGEFAULT_ACCESS_TYPE_WRITE: Write access
> > > > > type */
> > > > > +       XE_PAGEFAULT_ACCESS_TYPE_WRITE  = 1,
> > > > > +       /** @XE_PAGEFAULT_ACCESS_TYPE_ATOMIC: Atomic access
> > > > > type
> > > > > */
> > > > > +       XE_PAGEFAULT_ACCESS_TYPE_ATOMIC = 2,
> > > > > +};
> > > > > +
> > > > > +/** enum xe_pagefault_type - Xe page fault type */
> > > > > +enum xe_pagefault_type {
> > > > > +       /** @XE_PAGEFAULT_TYPE_NOT_PRESENT: Not present */
> > > > > +       XE_PAGEFAULT_TYPE_NOT_PRESENT           = 0,
> > > > > +       /** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Write
> > > > > access
> > > > > violation */
> > > > > +       XE_PAGEFAULT_WRITE_ACCESS_VIOLATION     = 1,
> > > > > +       /** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Atomic
> > > > > access
> > > > > violation */
> > > > 
> > > > XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION ->
> > > > XE_PAGEFAULT_ACCESS_TYPE_ATOMIC
> > > > 
> > > 
> > > The intended prefix here is 'XE_PAGEFAULT_TYPE_' to normalize the
> > > naming
> > > with 'enum xe_pagefault_type'.
> > 
> > Ah sorry you're right. I also should have been more specific that I
> > meant this should be ATOMIC access vs WRITE access, so:
> > XE_PAGEFAULT_TYPE_ATOMIC_ACCESS_VIOLATION
> > 
> > > 
> > > > > +       XE_PAGEFAULT_ATOMIC_ACCESS_VIOLATION    = 2,
> > > > > +};
> > > > > +
> > > > > +/** struct xe_pagefault_ops - Xe pagefault ops (producer) */
> > > > > +struct xe_pagefault_ops {
> > > > > +       /**
> > > > > +        * @ack_fault: Ack fault
> > > > > +        * @pf: Page fault
> > > > > +        * @err: Error state of fault
> > > > > +        *
> > > > > +        * Page fault producer receives acknowledgment from
> > > > > the
> > > > > consumer and
> > > > > +        * sends the result to the HW/FW interface.
> > > > > +        */
> > > > > +       void (*ack_fault)(struct xe_pagefault *pf, int err);
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * struct xe_pagefault - Xe page fault
> > > > > + *
> > > > > + * Generic page fault structure for communication between
> > > > > producer
> > > > > and consumer.
> > > > > + * Carefully sized to be 64 bytes.
> > > > > + */
> > > > > +struct xe_pagefault {
> > > > > +       /**
> > > > > +        * @gt: GT of fault
> > > > > +        *
> > > > > +        * XXX: We may want to decouple the GT from
> > > > > individual
> > > > > faults, as it's
> > > > > +        * unclear whether future platforms will always have
> > > > > a GT
> > > > > for
> > > > > all page
> > > > > +        * fault producers. Internally, the GT is used for
> > > > > stats,
> > > > > identifying
> > > > > +        * the appropriate VRAM region, and locating the
> > > > > migration
> > > > > queue.
> > > > > +        * Leaving this as-is for now, but we can revisit
> > > > > later
> > > > > to
> > > > > see if we
> > > > > +        * can convert it to use the Xe device pointer
> > > > > instead.
> > > > > +        */
> > > > 
> > > > What if instead of assuming the GT stays static and we
> > > > eventually
> > > > remove if we have some new HW abstraction layer that isn't a GT
> > > > but
> > > > still uses the page fault, we instead push to have said
> > > > theoretical
> > > > abstraction layer overload the GT here like we're doing with
> > > > primary
> > > > and media today. Then we can keep the interface here simple and
> > > > just
> > > > leave this in there, or change in the future if that doesn't
> > > > make
> > > > sense
> > > > without the suggestive comment?
> > > > 
> > > 
> > > I can remove this comment, as it adds some confusion. Hopefully,
> > > we
> > > always have a GT. I was just speculating about future cases where
> > > we
> > > might not have one. From a purely interface perspective, it would
> > > be
> > > ideal to completely decouple the GT here.
> > >  
> > > > > +       struct xe_gt *gt;
> > > > > +       /**
> > > > > +        * @consumer: State for the software handling the
> > > > > fault.
> > > > > Populated by
> > > > > +        * the producer and may be modified by the consumer
> > > > > to
> > > > > communicate
> > > > > +        * information back to the producer upon fault
> > > > > acknowledgment.
> > > > > +        */
> > > > > +       struct {
> > > > > +               /** @consumer.page_addr: address of page
> > > > > fault */
> > > > > +               u64 page_addr;
> > > > > +               /** @consumer.asid: address space ID */
> > > > > +               u32 asid;
> > > > 
> > > > Can we just call this an ID instead of a pasid or asid? I.e.
> > > > the ID
> > > > could be anything, not strictly process-bound.
> > > > 
> > > 
> > > I think the idea here is that this serves as the ID for our
> > > reverse
> > > VM
> > > lookup mechanism in the KMD. We call it ASID throughout the
> > > codebase
> > > today, so we’re stuck with the name—though it may or may not have
> > > any
> > > actual meaning in hardware, depending on the producer. For
> > > example,
> > > if
> > > the producer receives a fault based on a queue ID, we’d look up
> > > the
> > > queue and then pass in q->vm.asid.
> > > 
> > > We could even have the producer look up the VM directly, if
> > > preferred,
> > > and just pass that over. However, that would require a few more
> > > bits
> > > here and might introduce lifetime issues—for example, we’d have
> > > to
> > > refcount the VM.
> > 
> > Yeah I mean some of those problems we can solve if they come up
> > later.
> > Just thinking having something more generic here would be nice. But
> > I
> > agree on the cross-KMD usage. We can keep this and change it more
> > broadly if that makes sense later.
> > 
> 
> I think the point here is we are always going to have a VM which is
> required by the consumer to service the fault, the producer side
> needs
> parse the fault and figure out a known value in the KMD which
> corresponds to a VM and pass it over. We call this value asid today
> (also the name of hardware interface + what we program into the LRC)
> but
> could rename this everywhere in KMD if that makes sense. e.g.,
> kmd_vm_id (vm_id is a user space name / value which means something
> different).

Ok for now no problem. I agree this is consistent in the driver anyway
so better to make a more holistic change if we do that. I do like the
idea of having this more generic to allow all types of fault
originators. We can look later though.

> 
> > > 
> > > > > +               /** @consumer.access_type: access type */
> > > > > +               u8 access_type;
> > > > > +               /** @consumer.fault_type: fault type */
> > > > > +               u8 fault_type;
> > > > > +#define XE_PAGEFAULT_LEVEL_NACK                0xff    /*
> > > > > Producer
> > > > > indicates nack fault */
> > > > > +               /** @consumer.fault_level: fault level */
> > > > > +               u8 fault_level;
> > > > > +               /** @consumer.engine_class: engine class */
> > > > > +               u8 engine_class;
> > > > > +               /** consumer.reserved: reserved bits for
> > > > > future
> > > > > expansion */
> > > > > +               u64 reserved;
> > > > 
> > > > What about engine instance? Or is that going to overload
> > > > reserved
> > > > here?
> > > > 
> > > 
> > > reserved could be used to include 'engine instance' if required,
> > > is
> > > there for future expansion and also to have structure sized to 64
> > > bytes.
> > > 
> > > I include fault_level, engine_class as I though both were used by
> > > [1]
> > > but now that I looked again only fault level is used so I guess
> > > engine_class can be pulled out too unless we want to keep for the
> > > only
> > > place in which it is used (debug messages).
> > 
> > I think today hardware/GuC provides both engine class and engine
> > instance which is why I mentioned. We can ignore those fields if we
> > don't feel they are valuable/relevant, but at least today we are
> > reading those and printing them out.
> > 
> 
> Yes, the debug message drops engine instance due to not passing this
> value over. I think that is ok, engine class typically all we care
> about
> anyways.

I'd actually disagree here. I think it is valueable to print out all of
the fields we receive from hardware for debug puroses - i.e. if we get
a fault to a non-existant engine instance or to the wrong one, it could
help us narrow down issues in that area. At the very least we should
have a way to print and decode these even if we only actually store the
class.

Thanks,
Stuart

> 
> Matt
> 
> > Thanks,
> > Stuart
> > 
> > > 
> > > Matt
> > > 
> > > [1] https://patchwork.freedesktop.org/series/148727/
> > > 
> > > > Thanks,
> > > > Stuart
> > > > 
> > > > > +       } consumer;
> > > > > +       /**
> > > > > +        * @producer: State for the producer (i.e., HW/FW
> > > > > interface).
> > > > > Populated
> > > > > +        * by the producer and should not be modified—or even
> > > > > inspected—by the
> > > > > +        * consumer, except for calling operations.
> > > > > +        */
> > > > > +       struct {
> > > > > +               /** @producer.private: private pointer */
> > > > > +               void *private;
> > > > > +               /** @producer.ops: operations */
> > > > > +               const struct xe_pagefault_ops *ops;
> > > > > +#define XE_PAGEFAULT_PRODUCER_MSG_LEN_DW       4
> > > > > +               /**
> > > > > +                * producer.msg: page fault message, used by
> > > > > producer
> > > > > in fault
> > > > > +                * acknowledgement to formulate response to
> > > > > HW/FW
> > > > > interface.
> > > > > +                */
> > > > > +               u32 msg[XE_PAGEFAULT_PRODUCER_MSG_LEN_DW];
> > > > > +       } producer;
> > > > > +};
> > > > > +
> > > > > +/** struct xe_pagefault_queue: Xe pagefault queue (consumer)
> > > > > */
> > > > > +struct xe_pagefault_queue {
> > > > > +       /**
> > > > > +        * @data: Data in queue containing struct
> > > > > xe_pagefault,
> > > > > protected by
> > > > > +        * @lock
> > > > > +        */
> > > > > +       void *data;
> > > > > +       /** @size: Size of queue in bytes */
> > > > > +       u32 size;
> > > > > +       /** @head: Head pointer in bytes, moved by producer,
> > > > > protected by @lock */
> > > > > +       u32 head;
> > > > > +       /** @tail: Tail pointer in bytes, moved by consumer,
> > > > > protected by @lock */
> > > > > +       u32 tail;
> > > > > +       /** @lock: protects page fault queue */
> > > > > +       spinlock_t lock;
> > > > > +       /** @worker: to process page faults */
> > > > > +       struct work_struct worker;
> > > > > +};
> > > > > +
> > > > > +#endif
> > > > 
> > 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/11] drm/xe: Implement xe_pagefault_init
  2025-08-28 20:14     ` Matthew Brost
@ 2025-08-28 20:19       ` Summers, Stuart
  0 siblings, 0 replies; 51+ messages in thread
From: Summers, Stuart @ 2025-08-28 20:19 UTC (permalink / raw)
  To: Brost, Matthew
  Cc: intel-xe@lists.freedesktop.org, Dugast, Francois,
	Ghimiray, Himal Prasad, Mrozek, Michal,
	thomas.hellstrom@linux.intel.com

On Thu, 2025-08-28 at 13:14 -0700, Matthew Brost wrote:
> On Thu, Aug 28, 2025 at 02:10:02PM -0600, Summers, Stuart wrote:
> > On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> > > Create pagefault queues and initialize them.
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_device.c       |  5 ++
> > >  drivers/gpu/drm/xe/xe_device_types.h |  6 ++
> > >  drivers/gpu/drm/xe/xe_pagefault.c    | 93
> > > +++++++++++++++++++++++++++-
> > >  3 files changed, 102 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_device.c
> > > b/drivers/gpu/drm/xe/xe_device.c
> > > index 57edbc63da6f..c7c8aee03841 100644
> > > --- a/drivers/gpu/drm/xe/xe_device.c
> > > +++ b/drivers/gpu/drm/xe/xe_device.c
> > > @@ -50,6 +50,7 @@
> > >  #include "xe_nvm.h"
> > >  #include "xe_oa.h"
> > >  #include "xe_observation.h"
> > > +#include "xe_pagefault.h"
> > >  #include "xe_pat.h"
> > >  #include "xe_pcode.h"
> > >  #include "xe_pm.h"
> > > @@ -890,6 +891,10 @@ int xe_device_probe(struct xe_device *xe)
> > >         if (err)
> > >                 return err;
> > >  
> > > +       err = xe_pagefault_init(xe);
> > > +       if (err)
> > > +               return err;
> > > +
> > >         xe_nvm_init(xe);
> > >  
> > >         err = xe_heci_gsc_init(xe);
> > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> > > b/drivers/gpu/drm/xe/xe_device_types.h
> > > index 01e8fa0d2f9f..6aa119026ce9 100644
> > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > @@ -17,6 +17,7 @@
> > >  #include "xe_lmtt_types.h"
> > >  #include "xe_memirq_types.h"
> > >  #include "xe_oa_types.h"
> > > +#include "xe_pagefault_types.h"
> > >  #include "xe_platform_types.h"
> > >  #include "xe_pmu_types.h"
> > >  #include "xe_pt_types.h"
> > > @@ -394,6 +395,11 @@ struct xe_device {
> > >                 u32 next_asid;
> > >                 /** @usm.lock: protects UM state */
> > >                 struct rw_semaphore lock;
> > > +               /** @usm.pf_wq: page fault work queue, unbound,
> > > high
> > > priority */
> > > +               struct workqueue_struct *pf_wq;
> > > +#define XE_PAGEFAULT_QUEUE_COUNT       4
> > > +               /** @pf_queue: Page fault queues */
> > > +               struct xe_pagefault_queue
> > > pf_queue[XE_PAGEFAULT_QUEUE_COUNT];
> > >         } usm;
> > >  
> > >         /** @pinned: pinned BO state */
> > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> > > b/drivers/gpu/drm/xe/xe_pagefault.c
> > > index 3ce0e8d74b9d..14304c41eb23 100644
> > > --- a/drivers/gpu/drm/xe/xe_pagefault.c
> > > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > > @@ -3,6 +3,10 @@
> > >   * Copyright © 2025 Intel Corporation
> > >   */
> > >  
> > > +#include <drm/drm_managed.h>
> > > +
> > > +#include "xe_device.h"
> > > +#include "xe_gt_types.h"
> > >  #include "xe_pagefault.h"
> > >  #include "xe_pagefault_types.h"
> > >  
> > > @@ -19,6 +23,71 @@
> > >   * with a single shared consumer.
> > >   */
> > >  
> > > +static int xe_pagefault_entry_size(void)
> > > +{
> > > +       return roundup_pow_of_two(sizeof(struct xe_pagefault));
> > 
> > And here, it would be nice if you could add a brief comment that
> > this
> > assumes the size of struct xe_pagefault aligns to the hardware
> > requirements.
> > 
> 
> It is actually not a hardware thing, it the pagefault queue
> management
> code (software) where the logic breaks if we are not doing everything
> on
> pow2 boundaries. Ofc, this isn't a strick requirement, rather it just
> makes the code simplier.
> 
> I can add a comment around this.

Ok got it. Yeah that would be helpful here I agree.

Thanks,
Stuart

> 
> Matt
> 
> > Thanks,
> > Stuart
> > 
> > > +}
> > > +
> > > +static void xe_pagefault_queue_work(struct work_struct *w)
> > > +{
> > > +       /* TODO: Implement */
> > > +}
> > > +
> > > +static int xe_pagefault_queue_init(struct xe_device *xe,
> > > +                                  struct xe_pagefault_queue
> > > *pf_queue)
> > > +{
> > > +       struct xe_gt *gt;
> > > +       int total_num_eus = 0;
> > > +       u8 id;
> > > +
> > > +       for_each_gt(gt, xe, id) {
> > > +               xe_dss_mask_t all_dss;
> > > +               int num_dss, num_eus;
> > > +
> > > +               bitmap_or(all_dss, gt->fuse_topo.g_dss_mask,
> > > +                         gt->fuse_topo.c_dss_mask,
> > > XE_MAX_DSS_FUSE_BITS);
> > > +
> > > +               num_dss = bitmap_weight(all_dss,
> > > XE_MAX_DSS_FUSE_BITS);
> > > +               num_eus = bitmap_weight(gt-
> > > > fuse_topo.eu_mask_per_dss,
> > > +                                       XE_MAX_EU_FUSE_BITS) *
> > > num_dss;
> > > +
> > > +               total_num_eus += num_eus;
> > > +       }
> > > +
> > > +       xe_assert(xe, total_num_eus);
> > > +
> > > +       /*
> > > +        * user can issue separate page faults per EU and per CS
> > > +        *
> > > +        * XXX: Multiplier required as compute UMD are getting PF
> > > queue errors
> > > +        * without it. Follow on why this multiplier is required.
> > > +        */
> > > +#define PF_MULTIPLIER  8
> > > +       pf_queue->size = (total_num_eus + XE_NUM_HW_ENGINES) *
> > > +               xe_pagefault_entry_size() * PF_MULTIPLIER;
> > > +       pf_queue->size = roundup_pow_of_two(pf_queue->size);
> > > +#undef PF_MULTIPLIER
> > > +
> > > +       drm_dbg(&xe->drm, "xe_pagefault_entry_size=%d,
> > > total_num_eus=%d, pf_queue->size=%u",
> > > +               xe_pagefault_entry_size(), total_num_eus,
> > > pf_queue-
> > > > size);
> > > +
> > > +       pf_queue->data = devm_kzalloc(xe->drm.dev, pf_queue-
> > > >size,
> > > GFP_KERNEL);
> > > +       if (!pf_queue->data)
> > > +               return -ENOMEM;
> > > +
> > > +       spin_lock_init(&pf_queue->lock);
> > > +       INIT_WORK(&pf_queue->worker, xe_pagefault_queue_work);
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static void xe_pagefault_fini(void *arg)
> > > +{
> > > +       struct xe_device *xe = arg;
> > > +
> > > +       destroy_workqueue(xe->usm.pf_wq);
> > > +}
> > > +
> > >  /**
> > >   * xe_pagefault_init() - Page fault init
> > >   * @xe: xe device instance
> > > @@ -29,8 +98,28 @@
> > >   */
> > >  int xe_pagefault_init(struct xe_device *xe)
> > >  {
> > > -       /* TODO - implement */
> > > -       return 0;
> > > +       int err, i;
> > > +
> > > +       if (!xe->info.has_usm)
> > > +               return 0;
> > > +
> > > +       xe->usm.pf_wq =
> > > alloc_workqueue("xe_page_fault_work_queue",
> > > +                                       WQ_UNBOUND | WQ_HIGHPRI,
> > > +                                       XE_PAGEFAULT_QUEUE_COUNT)
> > > ;
> > > +       if (!xe->usm.pf_wq)
> > > +               return -ENOMEM;
> > > +
> > > +       for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i) {
> > > +               err = xe_pagefault_queue_init(xe, xe-
> > > >usm.pf_queue +
> > > i);
> > > +               if (err)
> > > +                       goto err_out;
> > > +       }
> > > +
> > > +       return devm_add_action_or_reset(xe->drm.dev,
> > > xe_pagefault_fini, xe);
> > > +
> > > +err_out:
> > > +       destroy_workqueue(xe->usm.pf_wq);
> > > +       return err;
> > >  }
> > >  
> > >  /**
> > 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/11] drm/xe: Stub out new pagefault layer
  2025-08-28 20:18           ` Summers, Stuart
@ 2025-08-28 20:20             ` Matthew Brost
  0 siblings, 0 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-28 20:20 UTC (permalink / raw)
  To: Summers, Stuart
  Cc: intel-xe@lists.freedesktop.org, thomas.hellstrom@linux.intel.com,
	Ghimiray, Himal Prasad, Dugast, Francois, Mrozek, Michal

On Thu, Aug 28, 2025 at 02:18:30PM -0600, Summers, Stuart wrote:
> On Thu, 2025-08-07 at 11:10 -0700, Matthew Brost wrote:
> > On Thu, Aug 07, 2025 at 11:20:06AM -0600, Summers, Stuart wrote:
> > > On Wed, 2025-08-06 at 16:53 -0700, Matthew Brost wrote:
> > > > On Wed, Aug 06, 2025 at 05:01:12PM -0600, Summers, Stuart wrote:
> > > > > Few basic comments below to start. I personally would rather
> > > > > this
> > > > > be
> > > > > brought over from the existing fault handler rather than
> > > > > creating
> > > > > something entirely new and then clobbering the older stuff -
> > > > > just
> > > > > so
> > > > > the review of message format requests/replies is easier to
> > > > > review
> > > > > and
> > > > > where we're deviating from the existing external interfaces
> > > > > (HW/FW/GuC/etc). You already have this here though so not a
> > > > > huge
> > > > > deal.
> > > > > I think most of this was in the giant blob of patches that got
> > > > > merged
> > > > > with the initial driver, so I guess the counter argument is we
> > > > > can
> > > > > have
> > > > > easy to reference historical reviews now.
> > > > > 
> > > > 
> > > > Yes, page fault code is largely just a big blob from the original
> > > > Xe
> > > > patch that wasn't the most well thought out code. We still have
> > > > that
> > > > history in the tree, just git blame won't work, so you'd need to
> > > > know
> > > > where to look if want that.
> > > > 
> > > > I don't think there is a great way to pull this over, unless
> > > > patches
> > > > 2-7
> > > > are squashed into a single patch + a couple of 'git mv' are used.
> > > 
> > > No definitely don't think that's worth it. Let's just review as-is.
> > > 
> > > > 
> > > > > On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> > > > > > Stub out the new page fault layer and add kernel
> > > > > > documentation.
> > > > > > This
> > > > > > is
> > > > > > intended as a replacement for the GT page fault layer,
> > > > > > enabling
> > > > > > multiple
> > > > > > producers to hook into a shared page fault consumer
> > > > > > interface.
> > > > > > 
> > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > ---
> > > > > >  drivers/gpu/drm/xe/Makefile             |   1 +
> > > > > >  drivers/gpu/drm/xe/xe_pagefault.c       |  63 ++++++++++++
> > > > > >  drivers/gpu/drm/xe/xe_pagefault.h       |  19 ++++
> > > > > >  drivers/gpu/drm/xe/xe_pagefault_types.h | 125
> > > > > > ++++++++++++++++++++++++
> > > > > >  4 files changed, 208 insertions(+)
> > > > > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.c
> > > > > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault.h
> > > > > >  create mode 100644 drivers/gpu/drm/xe/xe_pagefault_types.h
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/xe/Makefile
> > > > > > b/drivers/gpu/drm/xe/Makefile
> > > > > > index 8e0c3412a757..6fbebafe79c9 100644
> > > > > > --- a/drivers/gpu/drm/xe/Makefile
> > > > > > +++ b/drivers/gpu/drm/xe/Makefile
> > > > > > @@ -93,6 +93,7 @@ xe-y += xe_bb.o \
> > > > > >         xe_nvm.o \
> > > > > >         xe_oa.o \
> > > > > >         xe_observation.o \
> > > > > > +       xe_pagefault.o \
> > > > > >         xe_pat.o \
> > > > > >         xe_pci.o \
> > > > > >         xe_pcode.o \
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> > > > > > b/drivers/gpu/drm/xe/xe_pagefault.c
> > > > > > new file mode 100644
> > > > > > index 000000000000..3ce0e8d74b9d
> > > > > > --- /dev/null
> > > > > > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > > > > > @@ -0,0 +1,63 @@
> > > > > > +// SPDX-License-Identifier: MIT
> > > > > > +/*
> > > > > > + * Copyright © 2025 Intel Corporation
> > > > > > + */
> > > > > > +
> > > > > > +#include "xe_pagefault.h"
> > > > > > +#include "xe_pagefault_types.h"
> > > > > > +
> > > > > > +/**
> > > > > > + * DOC: Xe page faults
> > > > > > + *
> > > > > > + * Xe page faults are handled in two layers. The producer
> > > > > > layer
> > > > > > interacts with
> > > > > > + * hardware or firmware to receive and parse faults into
> > > > > > struct
> > > > > > xe_pagefault,
> > > > > > + * then forwards them to the consumer. The consumer layer
> > > > > > services
> > > > > > the faults
> > > > > > + * (e.g., memory migration, page table updates) and
> > > > > > acknowledges
> > > > > > the
> > > > > > result back
> > > > > > + * to the producer, which then forwards the results to the
> > > > > > hardware
> > > > > > or firmware.
> > > > > > + * The consumer uses a page fault queue sized to absorb all
> > > > > > potential faults and
> > > > > > + * a multi-threaded worker to process them. Multiple
> > > > > > producers
> > > > > > are
> > > > > > supported,
> > > > > > + * with a single shared consumer.
> > > > > > + */
> > > > > > +
> > > > > > +/**
> > > > > > + * xe_pagefault_init() - Page fault init
> > > > > > + * @xe: xe device instance
> > > > > > + *
> > > > > > + * Initialize Xe page fault state. Must be done after
> > > > > > reading
> > > > > > fuses.
> > > > > > + *
> > > > > > + * Return: 0 on Success, errno on failure
> > > > > > + */
> > > > > > +int xe_pagefault_init(struct xe_device *xe)
> > > > > > +{
> > > > > > +       /* TODO - implement */
> > > > > > +       return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * xe_pagefault_reset() - Page fault reset for a GT
> > > > > > + * @xe: xe device instance
> > > > > > + * @gt: GT being reset
> > > > > > + *
> > > > > > + * Reset the Xe page fault state for a GT; that is, squash
> > > > > > any
> > > > > > pending faults on
> > > > > > + * the GT.
> > > > > > + */
> > > > > > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt
> > > > > > *gt)
> > > > > > +{
> > > > > > +       /* TODO - implement */
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * xe_pagefault_handler() - Page fault handler
> > > > > > + * @xe: xe device instance
> > > > > > + * @pf: Page fault
> > > > > > + *
> > > > > > + * Sink the page fault to a queue (i.e., a memory buffer)
> > > > > > and
> > > > > > queue
> > > > > > a worker to
> > > > > > + * service it. Safe to be called from IRQ or process
> > > > > > context.
> > > > > > Reclaim safe.
> > > > > > + *
> > > > > > + * Return: 0 on success, errno on failure
> > > > > > + */
> > > > > > +int xe_pagefault_handler(struct xe_device *xe, struct
> > > > > > xe_pagefault
> > > > > > *pf)
> > > > > > +{
> > > > > > +       /* TODO - implement */
> > > > > > +       return 0;
> > > > > > +}
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_pagefault.h
> > > > > > b/drivers/gpu/drm/xe/xe_pagefault.h
> > > > > > new file mode 100644
> > > > > > index 000000000000..bd0cdf9ed37f
> > > > > > --- /dev/null
> > > > > > +++ b/drivers/gpu/drm/xe/xe_pagefault.h
> > > > > > @@ -0,0 +1,19 @@
> > > > > > +/* SPDX-License-Identifier: MIT */
> > > > > > +/*
> > > > > > + * Copyright © 2025 Intel Corporation
> > > > > > + */
> > > > > > +
> > > > > > +#ifndef _XE_PAGEFAULT_H_
> > > > > > +#define _XE_PAGEFAULT_H_
> > > > > > +
> > > > > > +struct xe_device;
> > > > > > +struct xe_gt;
> > > > > > +struct xe_pagefault;
> > > > > > +
> > > > > > +int xe_pagefault_init(struct xe_device *xe);
> > > > > > +
> > > > > > +void xe_pagefault_reset(struct xe_device *xe, struct xe_gt
> > > > > > *gt);
> > > > > > +
> > > > > > +int xe_pagefault_handler(struct xe_device *xe, struct
> > > > > > xe_pagefault
> > > > > > *pf);
> > > > > > +
> > > > > > +#endif
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > > > > b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > > > > new file mode 100644
> > > > > > index 000000000000..fcff84f93dd8
> > > > > > --- /dev/null
> > > > > > +++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
> > > > > > @@ -0,0 +1,125 @@
> > > > > > +/* SPDX-License-Identifier: MIT */
> > > > > > +/*
> > > > > > + * Copyright © 2025 Intel Corporation
> > > > > > + */
> > > > > > +
> > > > > > +#ifndef _XE_PAGEFAULT_TYPES_H_
> > > > > > +#define _XE_PAGEFAULT_TYPES_H_
> > > > > > +
> > > > > > +#include <linux/workqueue.h>
> > > > > > +
> > > > > > +struct xe_pagefault;
> > > > > > +struct xe_gt;
> > > > > 
> > > > > Nit: Maybe reverse these structs to be in alphabetical order
> > > > > 
> > > > 
> > > > Yes, that is the preferred style. Will fix.
> > > > 
> > > > > > +
> > > > > > +/** enum xe_pagefault_access_type - Xe page fault access
> > > > > > type */
> > > > > > +enum xe_pagefault_access_type {
> > > > > > +       /** @XE_PAGEFAULT_ACCESS_TYPE_READ: Read access type
> > > > > > */
> > > > > > +       XE_PAGEFAULT_ACCESS_TYPE_READ   = 0,
> > > > > > +       /** @XE_PAGEFAULT_ACCESS_TYPE_WRITE: Write access
> > > > > > type */
> > > > > > +       XE_PAGEFAULT_ACCESS_TYPE_WRITE  = 1,
> > > > > > +       /** @XE_PAGEFAULT_ACCESS_TYPE_ATOMIC: Atomic access
> > > > > > type
> > > > > > */
> > > > > > +       XE_PAGEFAULT_ACCESS_TYPE_ATOMIC = 2,
> > > > > > +};
> > > > > > +
> > > > > > +/** enum xe_pagefault_type - Xe page fault type */
> > > > > > +enum xe_pagefault_type {
> > > > > > +       /** @XE_PAGEFAULT_TYPE_NOT_PRESENT: Not present */
> > > > > > +       XE_PAGEFAULT_TYPE_NOT_PRESENT           = 0,
> > > > > > +       /** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Write
> > > > > > access
> > > > > > violation */
> > > > > > +       XE_PAGEFAULT_WRITE_ACCESS_VIOLATION     = 1,
> > > > > > +       /** @XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION: Atomic
> > > > > > access
> > > > > > violation */
> > > > > 
> > > > > XE_PAGEFAULT_TYPE_WRITE_ACCESS_VIOLATION ->
> > > > > XE_PAGEFAULT_ACCESS_TYPE_ATOMIC
> > > > > 
> > > > 
> > > > The intended prefix here is 'XE_PAGEFAULT_TYPE_' to normalize the
> > > > naming
> > > > with 'enum xe_pagefault_type'.
> > > 
> > > Ah sorry you're right. I also should have been more specific that I
> > > meant this should be ATOMIC access vs WRITE access, so:
> > > XE_PAGEFAULT_TYPE_ATOMIC_ACCESS_VIOLATION
> > > 
> > > > 
> > > > > > +       XE_PAGEFAULT_ATOMIC_ACCESS_VIOLATION    = 2,
> > > > > > +};
> > > > > > +
> > > > > > +/** struct xe_pagefault_ops - Xe pagefault ops (producer) */
> > > > > > +struct xe_pagefault_ops {
> > > > > > +       /**
> > > > > > +        * @ack_fault: Ack fault
> > > > > > +        * @pf: Page fault
> > > > > > +        * @err: Error state of fault
> > > > > > +        *
> > > > > > +        * Page fault producer receives acknowledgment from
> > > > > > the
> > > > > > consumer and
> > > > > > +        * sends the result to the HW/FW interface.
> > > > > > +        */
> > > > > > +       void (*ack_fault)(struct xe_pagefault *pf, int err);
> > > > > > +};
> > > > > > +
> > > > > > +/**
> > > > > > + * struct xe_pagefault - Xe page fault
> > > > > > + *
> > > > > > + * Generic page fault structure for communication between
> > > > > > producer
> > > > > > and consumer.
> > > > > > + * Carefully sized to be 64 bytes.
> > > > > > + */
> > > > > > +struct xe_pagefault {
> > > > > > +       /**
> > > > > > +        * @gt: GT of fault
> > > > > > +        *
> > > > > > +        * XXX: We may want to decouple the GT from
> > > > > > individual
> > > > > > faults, as it's
> > > > > > +        * unclear whether future platforms will always have
> > > > > > a GT
> > > > > > for
> > > > > > all page
> > > > > > +        * fault producers. Internally, the GT is used for
> > > > > > stats,
> > > > > > identifying
> > > > > > +        * the appropriate VRAM region, and locating the
> > > > > > migration
> > > > > > queue.
> > > > > > +        * Leaving this as-is for now, but we can revisit
> > > > > > later
> > > > > > to
> > > > > > see if we
> > > > > > +        * can convert it to use the Xe device pointer
> > > > > > instead.
> > > > > > +        */
> > > > > 
> > > > > What if instead of assuming the GT stays static and we
> > > > > eventually
> > > > > remove if we have some new HW abstraction layer that isn't a GT
> > > > > but
> > > > > still uses the page fault, we instead push to have said
> > > > > theoretical
> > > > > abstraction layer overload the GT here like we're doing with
> > > > > primary
> > > > > and media today. Then we can keep the interface here simple and
> > > > > just
> > > > > leave this in there, or change in the future if that doesn't
> > > > > make
> > > > > sense
> > > > > without the suggestive comment?
> > > > > 
> > > > 
> > > > I can remove this comment, as it adds some confusion. Hopefully,
> > > > we
> > > > always have a GT. I was just speculating about future cases where
> > > > we
> > > > might not have one. From a purely interface perspective, it would
> > > > be
> > > > ideal to completely decouple the GT here.
> > > >  
> > > > > > +       struct xe_gt *gt;
> > > > > > +       /**
> > > > > > +        * @consumer: State for the software handling the
> > > > > > fault.
> > > > > > Populated by
> > > > > > +        * the producer and may be modified by the consumer
> > > > > > to
> > > > > > communicate
> > > > > > +        * information back to the producer upon fault
> > > > > > acknowledgment.
> > > > > > +        */
> > > > > > +       struct {
> > > > > > +               /** @consumer.page_addr: address of page
> > > > > > fault */
> > > > > > +               u64 page_addr;
> > > > > > +               /** @consumer.asid: address space ID */
> > > > > > +               u32 asid;
> > > > > 
> > > > > Can we just call this an ID instead of a pasid or asid? I.e.
> > > > > the ID
> > > > > could be anything, not strictly process-bound.
> > > > > 
> > > > 
> > > > I think the idea here is that this serves as the ID for our
> > > > reverse
> > > > VM
> > > > lookup mechanism in the KMD. We call it ASID throughout the
> > > > codebase
> > > > today, so we’re stuck with the name—though it may or may not have
> > > > any
> > > > actual meaning in hardware, depending on the producer. For
> > > > example,
> > > > if
> > > > the producer receives a fault based on a queue ID, we’d look up
> > > > the
> > > > queue and then pass in q->vm.asid.
> > > > 
> > > > We could even have the producer look up the VM directly, if
> > > > preferred,
> > > > and just pass that over. However, that would require a few more
> > > > bits
> > > > here and might introduce lifetime issues—for example, we’d have
> > > > to
> > > > refcount the VM.
> > > 
> > > Yeah I mean some of those problems we can solve if they come up
> > > later.
> > > Just thinking having something more generic here would be nice. But
> > > I
> > > agree on the cross-KMD usage. We can keep this and change it more
> > > broadly if that makes sense later.
> > > 
> > 
> > I think the point here is we are always going to have a VM which is
> > required by the consumer to service the fault, the producer side
> > needs
> > parse the fault and figure out a known value in the KMD which
> > corresponds to a VM and pass it over. We call this value asid today
> > (also the name of hardware interface + what we program into the LRC)
> > but
> > could rename this everywhere in KMD if that makes sense. e.g.,
> > kmd_vm_id (vm_id is a user space name / value which means something
> > different).
> 
> Ok for now no problem. I agree this is consistent in the driver anyway
> so better to make a more holistic change if we do that. I do like the
> idea of having this more generic to allow all types of fault
> originators. We can look later though.
> 
> > 
> > > > 
> > > > > > +               /** @consumer.access_type: access type */
> > > > > > +               u8 access_type;
> > > > > > +               /** @consumer.fault_type: fault type */
> > > > > > +               u8 fault_type;
> > > > > > +#define XE_PAGEFAULT_LEVEL_NACK                0xff    /*
> > > > > > Producer
> > > > > > indicates nack fault */
> > > > > > +               /** @consumer.fault_level: fault level */
> > > > > > +               u8 fault_level;
> > > > > > +               /** @consumer.engine_class: engine class */
> > > > > > +               u8 engine_class;
> > > > > > +               /** consumer.reserved: reserved bits for
> > > > > > future
> > > > > > expansion */
> > > > > > +               u64 reserved;
> > > > > 
> > > > > What about engine instance? Or is that going to overload
> > > > > reserved
> > > > > here?
> > > > > 
> > > > 
> > > > reserved could be used to include 'engine instance' if required,
> > > > is
> > > > there for future expansion and also to have structure sized to 64
> > > > bytes.
> > > > 
> > > > I include fault_level, engine_class as I though both were used by
> > > > [1]
> > > > but now that I looked again only fault level is used so I guess
> > > > engine_class can be pulled out too unless we want to keep for the
> > > > only
> > > > place in which it is used (debug messages).
> > > 
> > > I think today hardware/GuC provides both engine class and engine
> > > instance which is why I mentioned. We can ignore those fields if we
> > > don't feel they are valuable/relevant, but at least today we are
> > > reading those and printing them out.
> > > 
> > 
> > Yes, the debug message drops engine instance due to not passing this
> > value over. I think that is ok, engine class typically all we care
> > about
> > anyways.
> 
> I'd actually disagree here. I think it is valueable to print out all of
> the fields we receive from hardware for debug puroses - i.e. if we get
> a fault to a non-existant engine instance or to the wrong one, it could
> help us narrow down issues in that area. At the very least we should
> have a way to print and decode these even if we only actually store the
> class.
> 

Ok, since we have the extra bits, I'll include bring this back in and
ensure we retain all information in the current debug message.

Matt 

> Thanks,
> Stuart
> 
> > 
> > Matt
> > 
> > > Thanks,
> > > Stuart
> > > 
> > > > 
> > > > Matt
> > > > 
> > > > [1] https://patchwork.freedesktop.org/series/148727/
> > > > 
> > > > > Thanks,
> > > > > Stuart
> > > > > 
> > > > > > +       } consumer;
> > > > > > +       /**
> > > > > > +        * @producer: State for the producer (i.e., HW/FW
> > > > > > interface).
> > > > > > Populated
> > > > > > +        * by the producer and should not be modified—or even
> > > > > > inspected—by the
> > > > > > +        * consumer, except for calling operations.
> > > > > > +        */
> > > > > > +       struct {
> > > > > > +               /** @producer.private: private pointer */
> > > > > > +               void *private;
> > > > > > +               /** @producer.ops: operations */
> > > > > > +               const struct xe_pagefault_ops *ops;
> > > > > > +#define XE_PAGEFAULT_PRODUCER_MSG_LEN_DW       4
> > > > > > +               /**
> > > > > > +                * producer.msg: page fault message, used by
> > > > > > producer
> > > > > > in fault
> > > > > > +                * acknowledgement to formulate response to
> > > > > > HW/FW
> > > > > > interface.
> > > > > > +                */
> > > > > > +               u32 msg[XE_PAGEFAULT_PRODUCER_MSG_LEN_DW];
> > > > > > +       } producer;
> > > > > > +};
> > > > > > +
> > > > > > +/** struct xe_pagefault_queue: Xe pagefault queue (consumer)
> > > > > > */
> > > > > > +struct xe_pagefault_queue {
> > > > > > +       /**
> > > > > > +        * @data: Data in queue containing struct
> > > > > > xe_pagefault,
> > > > > > protected by
> > > > > > +        * @lock
> > > > > > +        */
> > > > > > +       void *data;
> > > > > > +       /** @size: Size of queue in bytes */
> > > > > > +       u32 size;
> > > > > > +       /** @head: Head pointer in bytes, moved by producer,
> > > > > > protected by @lock */
> > > > > > +       u32 head;
> > > > > > +       /** @tail: Tail pointer in bytes, moved by consumer,
> > > > > > protected by @lock */
> > > > > > +       u32 tail;
> > > > > > +       /** @lock: protects page fault queue */
> > > > > > +       spinlock_t lock;
> > > > > > +       /** @worker: to process page faults */
> > > > > > +       struct work_struct worker;
> > > > > > +};
> > > > > > +
> > > > > > +#endif
> > > > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 04/11] drm/xe: Implement xe_pagefault_handler
  2025-08-06  6:22 ` [PATCH 04/11] drm/xe: Implement xe_pagefault_handler Matthew Brost
  2025-08-28 11:26   ` Francois Dugast
@ 2025-08-28 20:24   ` Summers, Stuart
  1 sibling, 0 replies; 51+ messages in thread
From: Summers, Stuart @ 2025-08-28 20:24 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Brost,  Matthew
  Cc: Mrozek, Michal, Ghimiray, Himal Prasad,
	thomas.hellstrom@linux.intel.com, Dugast, Francois

On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> Enqueue (copy) the input struct xe_pagefault into a queue (i.e., into
> a
> memory buffer) and schedule a worker to service it.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_pagefault.c | 32
> +++++++++++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> b/drivers/gpu/drm/xe/xe_pagefault.c
> index aef389e51612..98be3203a9df 100644
> --- a/drivers/gpu/drm/xe/xe_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> @@ -3,6 +3,8 @@
>   * Copyright © 2025 Intel Corporation
>   */
>  
> +#include <linux/circ_buf.h>
> +
>  #include <drm/drm_managed.h>
>  
>  #include "xe_device.h"
> @@ -156,6 +158,14 @@ void xe_pagefault_reset(struct xe_device *xe,
> struct xe_gt *gt)
>                 xe_pagefault_queue_reset(xe, gt, xe->usm.pf_queue +
> i);
>  }
>  
> +static bool xe_pagefault_queue_full(struct xe_pagefault_queue
> *pf_queue)
> +{
> +       lockdep_assert_held(&pf_queue->lock);
> +
> +       return CIRC_SPACE(pf_queue->head, pf_queue->tail, pf_queue-
> >size) <=
> +               xe_pagefault_entry_size();
> +}
> +
>  /**
>   * xe_pagefault_handler() - Page fault handler
>   * @xe: xe device instance
> @@ -168,6 +178,24 @@ void xe_pagefault_reset(struct xe_device *xe,
> struct xe_gt *gt)
>   */
>  int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault
> *pf)
>  {
> -       /* TODO - implement */
> -       return 0;
> +       struct xe_pagefault_queue *pf_queue = xe->usm.pf_queue +
> +               (pf->consumer.asid % XE_PAGEFAULT_QUEUE_COUNT);
> +       unsigned long flags;
> +       bool full;
> +
> +       spin_lock_irqsave(&pf_queue->lock, flags);
> +       full = xe_pagefault_queue_full(pf_queue);
> +       if (!full) {
> +               memcpy(pf_queue->data + pf_queue->head, pf,
> sizeof(*pf));
> +               pf_queue->head = (pf_queue->head +
> xe_pagefault_entry_size()) %
> +                       pf_queue->size;
> +               queue_work(xe->usm.pf_wq, &pf_queue->worker);
> +       } else {
> +               drm_warn(&xe->drm,
> +                        "PageFault Queue (%d) full, shouldn't be
> possible\n",

Actually, does it make sense to drop the "shouldn't be possible" part?
We already have this as a warn and we have a history now of
calculations here. It also could be that some future hardware changes
this somehow and we want to know that the calculation isn't right, but
that would be a legitimate bug in that case (and even so on existing
hardware where we have a bug in the calculation).

Thanks,
Stuart

> +                        pf->consumer.asid %
> XE_PAGEFAULT_QUEUE_COUNT);
> +       }
> +       spin_unlock_irqrestore(&pf_queue->lock, flags);
> +
> +       return full ? -ENOSPC : 0;
>  }


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/11] drm/xe: Implement xe_pagefault_queue_work
  2025-08-06  6:22 ` [PATCH 05/11] drm/xe: Implement xe_pagefault_queue_work Matthew Brost
  2025-08-28 12:29   ` Francois Dugast
@ 2025-08-28 22:04   ` Summers, Stuart
  2025-08-29  0:51     ` Matthew Brost
  1 sibling, 1 reply; 51+ messages in thread
From: Summers, Stuart @ 2025-08-28 22:04 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Brost,  Matthew
  Cc: Mrozek, Michal, Ghimiray, Himal Prasad,
	thomas.hellstrom@linux.intel.com, Dugast, Francois

On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> Implement a worker that services page faults, using the same
> implementation as in xe_gt_pagefault.c.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_pagefault.c | 240
> +++++++++++++++++++++++++++++-
>  1 file changed, 239 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> b/drivers/gpu/drm/xe/xe_pagefault.c
> index 98be3203a9df..474412c21ec3 100644
> --- a/drivers/gpu/drm/xe/xe_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> @@ -5,12 +5,20 @@
>  
>  #include <linux/circ_buf.h>
>  
> +#include <drm/drm_exec.h>
>  #include <drm/drm_managed.h>
>  
> +#include "xe_bo.h"
>  #include "xe_device.h"
> +#include "xe_gt_printk.h"
>  #include "xe_gt_types.h"
> +#include "xe_gt_stats.h"
> +#include "xe_hw_engine.h"
>  #include "xe_pagefault.h"
>  #include "xe_pagefault_types.h"
> +#include "xe_svm.h"
> +#include "xe_trace_bo.h"
> +#include "xe_vm.h"
>  
>  /**
>   * DOC: Xe page faults
> @@ -30,9 +38,239 @@ static int xe_pagefault_entry_size(void)
>         return roundup_pow_of_two(sizeof(struct xe_pagefault));
>  }
>  
> +static int xe_pagefault_begin(struct drm_exec *exec, struct xe_vma
> *vma,
> +                             bool atomic, unsigned int id)
> +{
> +       struct xe_bo *bo = xe_vma_bo(vma);
> +       struct xe_vm *vm = xe_vma_vm(vma);
> +       int err;
> +
> +       err = xe_vm_lock_vma(exec, vma);
> +       if (err)
> +               return err;
> +
> +       if (atomic && IS_DGFX(vm->xe)) {
> +               if (xe_vma_is_userptr(vma)) {
> +                       err = -EACCES;
> +                       return err;
> +               }
> +
> +               /* Migrate to VRAM, move should invalidate the VMA
> first */
> +               err = xe_bo_migrate(bo, XE_PL_VRAM0 + id);
> +               if (err)
> +                       return err;
> +       } else if (bo) {
> +               /* Create backing store if needed */
> +               err = xe_bo_validate(bo, vm, true);
> +               if (err)
> +                       return err;
> +       }
> +
> +       return 0;
> +}
> +
> +static int xe_pagefault_handle_vma(struct xe_gt *gt, struct xe_vma
> *vma,
> +                                  bool atomic)
> +{
> +       struct xe_vm *vm = xe_vma_vm(vma);
> +       struct xe_tile *tile = gt_to_tile(gt);
> +       struct drm_exec exec;
> +       struct dma_fence *fence;
> +       ktime_t end = 0;
> +       int err;
> +
> +       lockdep_assert_held_write(&vm->lock);
> +
> +       xe_gt_stats_incr(gt, XE_GT_STATS_ID_VMA_PAGEFAULT_COUNT, 1);
> +       xe_gt_stats_incr(gt, XE_GT_STATS_ID_VMA_PAGEFAULT_KB,
> +                        xe_vma_size(vma) / SZ_1K);
> +
> +       trace_xe_vma_pagefault(vma);
> +
> +       /* Check if VMA is valid, opportunistic check only */
> +       if (xe_vm_has_valid_gpu_mapping(tile, vma->tile_present,
> +                                       vma->tile_invalidated) &&
> !atomic)
> +               return 0;
> +
> +retry_userptr:
> +       if (xe_vma_is_userptr(vma) &&
> +           xe_vma_userptr_check_repin(to_userptr_vma(vma))) {
> +               struct xe_userptr_vma *uvma = to_userptr_vma(vma);
> +
> +               err = xe_vma_userptr_pin_pages(uvma);
> +               if (err)
> +                       return err;
> +       }
> +
> +       /* Lock VM and BOs dma-resv */
> +       drm_exec_init(&exec, 0, 0);
> +       drm_exec_until_all_locked(&exec) {
> +               err = xe_pagefault_begin(&exec, vma, atomic, tile-
> >id);
> +               drm_exec_retry_on_contention(&exec);
> +               if (xe_vm_validate_should_retry(&exec, err, &end))
> +                       err = -EAGAIN;
> +               if (err)
> +                       goto unlock_dma_resv;
> +
> +               /* Bind VMA only to the GT that has faulted */
> +               trace_xe_vma_pf_bind(vma);
> +               fence = xe_vma_rebind(vm, vma, BIT(tile->id));
> +               if (IS_ERR(fence)) {
> +                       err = PTR_ERR(fence);
> +                       if (xe_vm_validate_should_retry(&exec, err,
> &end))
> +                               err = -EAGAIN;
> +                       goto unlock_dma_resv;
> +               }
> +       }
> +
> +       dma_fence_wait(fence, false);
> +       dma_fence_put(fence);
> +
> +unlock_dma_resv:
> +       drm_exec_fini(&exec);
> +       if (err == -EAGAIN)
> +               goto retry_userptr;
> +
> +       return err;
> +}
> +
> +static bool
> +xe_pagefault_access_is_atomic(enum xe_pagefault_access_type
> access_type)
> +{
> +       return access_type == XE_PAGEFAULT_ACCESS_TYPE_ATOMIC;
> +}
> +
> +static struct xe_vm *xe_pagefault_asid_to_vm(struct xe_device *xe,
> u32 asid)
> +{
> +       struct xe_vm *vm;
> +
> +       down_read(&xe->usm.lock);
> +       vm = xa_load(&xe->usm.asid_to_vm, asid);
> +       if (vm && xe_vm_in_fault_mode(vm))
> +               xe_vm_get(vm);
> +       else
> +               vm = ERR_PTR(-EINVAL);
> +       up_read(&xe->usm.lock);
> +
> +       return vm;
> +}
> +
> +static int xe_pagefault_service(struct xe_pagefault *pf)
> +{
> +       struct xe_gt *gt = pf->gt;
> +       struct xe_device *xe = gt_to_xe(gt);
> +       struct xe_vm *vm;
> +       struct xe_vma *vma = NULL;
> +       int err;
> +       bool atomic;
> +
> +       /* Producer flagged this fault to be nacked */
> +       if (pf->consumer.fault_level == XE_PAGEFAULT_LEVEL_NACK)
> +               return -EFAULT;
> +
> +       vm = xe_pagefault_asid_to_vm(xe, pf->consumer.asid);
> +       if (IS_ERR(vm))
> +               return PTR_ERR(vm);
> +
> +       /*
> +        * TODO: Change to read lock? Using write lock for
> simplicity.
> +        */
> +       down_write(&vm->lock);
> +
> +       if (xe_vm_is_closed(vm)) {
> +               err = -ENOENT;
> +               goto unlock_vm;
> +       }
> +
> +       vma = xe_vm_find_vma_by_addr(vm, pf->consumer.page_addr);
> +       if (!vma) {
> +               err = -EINVAL;
> +               goto unlock_vm;
> +       }
> +
> +       atomic = xe_pagefault_access_is_atomic(pf-
> >consumer.access_type);
> +
> +       if (xe_vma_is_cpu_addr_mirror(vma))
> +               err = xe_svm_handle_pagefault(vm, vma, gt,
> +                                             pf->consumer.page_addr,
> atomic);
> +       else
> +               err = xe_pagefault_handle_vma(gt, vma, atomic);
> +
> +unlock_vm:
> +       if (!err)
> +               vm->usm.last_fault_vma = vma;
> +       up_write(&vm->lock);
> +       xe_vm_put(vm);
> +
> +       return err;
> +}
> +
> +static bool xe_pagefault_queue_pop(struct xe_pagefault_queue
> *pf_queue,
> +                                  struct xe_pagefault *pf)
> +{
> +       bool found_fault = false;
> +
> +       spin_lock_irq(&pf_queue->lock);
> +       if (pf_queue->tail != pf_queue->head) {
> +               memcpy(pf, pf_queue->data + pf_queue->tail,
> sizeof(*pf));
> +               pf_queue->tail = (pf_queue->tail +
> xe_pagefault_entry_size()) %
> +                       pf_queue->size;
> +               found_fault = true;
> +       }
> +       spin_unlock_irq(&pf_queue->lock);
> +
> +       return found_fault;
> +}
> +
> +static void xe_pagefault_print(struct xe_pagefault *pf)
> +{
> +       xe_gt_dbg(pf->gt, "\n\tASID: %d\n"
> +                 "\tFaulted Address: 0x%08x%08x\n"
> +                 "\tFaultType: %d\n"
> +                 "\tAccessType: %d\n"
> +                 "\tFaultLevel: %d\n"
> +                 "\tEngineClass: %d %s\n",
> +                 pf->consumer.asid,
> +                 upper_32_bits(pf->consumer.page_addr),
> +                 lower_32_bits(pf->consumer.page_addr),
> +                 pf->consumer.fault_type,
> +                 pf->consumer.access_type,
> +                 pf->consumer.fault_level,
> +                 pf->consumer.engine_class,
> +                 xe_hw_engine_class_to_str(pf-
> >consumer.engine_class));
> +}
> +
>  static void xe_pagefault_queue_work(struct work_struct *w)
>  {
> -       /* TODO: Implement */
> +       struct xe_pagefault_queue *pf_queue =
> +               container_of(w, typeof(*pf_queue), worker);
> +       struct xe_pagefault pf;
> +       unsigned long threshold;
> +
> +#define USM_QUEUE_MAX_RUNTIME_MS      20
> +       threshold = jiffies +
> msecs_to_jiffies(USM_QUEUE_MAX_RUNTIME_MS);
> +
> +       while (xe_pagefault_queue_pop(pf_queue, &pf)) {
> +               int err;
> +
> +               if (!pf.gt)     /* Fault squashed during reset */
> +                       continue;
> +
> +               err = xe_pagefault_service(&pf);
> +               if (err) {
> +                       xe_pagefault_print(&pf);

I realize you're just copying over the existing functionality here.
Since this is already, but should we change this to an info and use dbg
to just print all incoming faults?

Anyway even if we do want to do this, not really applicable here since
this is just copying as mentioned.

With the changes Francois suggested in place:
Reviewed-by: Stuart Summers <stuart.summers@intel.com>

Thanks,
Stuart

> +                       xe_gt_dbg(pf.gt, "Fault response:
> Unsuccessful %pe\n",
> +                                 ERR_PTR(err));
> +               }
> +
> +               pf.producer.ops->ack_fault(&pf, err);
> +
> +               if (time_after(jiffies, threshold)) {
> +                       queue_work(gt_to_xe(pf.gt)->usm.pf_wq, w);
> +                       break;
> +               }
> +       }
> +#undef USM_QUEUE_MAX_RUNTIME_MS
>  }
>  
>  static int xe_pagefault_queue_init(struct xe_device *xe,


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/11] drm/xe: Add xe_guc_pagefault layer
  2025-08-06  6:22 ` [PATCH 06/11] drm/xe: Add xe_guc_pagefault layer Matthew Brost
  2025-08-28 13:27   ` Francois Dugast
@ 2025-08-28 22:11   ` Summers, Stuart
  2025-08-29  0:54     ` Matthew Brost
  1 sibling, 1 reply; 51+ messages in thread
From: Summers, Stuart @ 2025-08-28 22:11 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Brost,  Matthew
  Cc: Mrozek, Michal, Ghimiray, Himal Prasad,
	thomas.hellstrom@linux.intel.com, Dugast, Francois

On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> Add xe_guc_pagefault layer (producer) which parses G2H fault messages
> messages into struct xe_pagefault, forwards them to the page fault
> layer
> (consumer) for servicing, and provides a vfunc to acknowledge faults
> to
> the GuC upon completion. Replace the old (and incorrect) GT page
> fault
> layer with this new layer throughout the driver.
> 
> Signed-off-bt: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/Makefile           |  2 +-
>  drivers/gpu/drm/xe/xe_gt.c            |  6 --
>  drivers/gpu/drm/xe/xe_guc_ct.c        |  6 +-
>  drivers/gpu/drm/xe/xe_guc_pagefault.c | 94
> +++++++++++++++++++++++++++
>  drivers/gpu/drm/xe/xe_guc_pagefault.h | 13 ++++
>  drivers/gpu/drm/xe/xe_svm.c           |  3 +-
>  drivers/gpu/drm/xe/xe_vm.c            |  1 -
>  7 files changed, 110 insertions(+), 15 deletions(-)
>  create mode 100644 drivers/gpu/drm/xe/xe_guc_pagefault.c
>  create mode 100644 drivers/gpu/drm/xe/xe_guc_pagefault.h
> 
> diff --git a/drivers/gpu/drm/xe/Makefile
> b/drivers/gpu/drm/xe/Makefile
> index 6fbebafe79c9..c103c114b75c 100644
> --- a/drivers/gpu/drm/xe/Makefile
> +++ b/drivers/gpu/drm/xe/Makefile
> @@ -58,7 +58,6 @@ xe-y += xe_bb.o \
>         xe_gt_freq.o \
>         xe_gt_idle.o \
>         xe_gt_mcr.o \
> -       xe_gt_pagefault.o \
>         xe_gt_sysfs.o \
>         xe_gt_throttle.o \
>         xe_gt_tlb_invalidation.o \
> @@ -75,6 +74,7 @@ xe-y += xe_bb.o \
>         xe_guc_id_mgr.o \
>         xe_guc_klv_helpers.o \
>         xe_guc_log.o \
> +       xe_guc_pagefault.o \
>         xe_guc_pc.o \
>         xe_guc_submit.o \
>         xe_heci_gsc.o \
> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> index 5aa03f89a062..35c7ba7828a6 100644
> --- a/drivers/gpu/drm/xe/xe_gt.c
> +++ b/drivers/gpu/drm/xe/xe_gt.c
> @@ -32,7 +32,6 @@
>  #include "xe_gt_freq.h"
>  #include "xe_gt_idle.h"
>  #include "xe_gt_mcr.h"
> -#include "xe_gt_pagefault.h"
>  #include "xe_gt_printk.h"
>  #include "xe_gt_sriov_pf.h"
>  #include "xe_gt_sriov_vf.h"
> @@ -634,10 +633,6 @@ int xe_gt_init(struct xe_gt *gt)
>         if (err)
>                 return err;
>  
> -       err = xe_gt_pagefault_init(gt);
> -       if (err)
> -               return err;
> -
>         err = xe_gt_idle_init(&gt->gtidle);
>         if (err)
>                 return err;
> @@ -848,7 +843,6 @@ static int gt_reset(struct xe_gt *gt)
>         xe_uc_gucrc_disable(&gt->uc);
>         xe_uc_stop_prepare(&gt->uc);
>         xe_pagefault_reset(gt_to_xe(gt), gt);
> -       xe_gt_pagefault_reset(gt);
>  
>         xe_uc_stop(&gt->uc);
>  
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c
> b/drivers/gpu/drm/xe/xe_guc_ct.c
> index 3f4e6a46ff16..67b5dd182207 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -21,7 +21,6 @@
>  #include "xe_devcoredump.h"
>  #include "xe_device.h"
>  #include "xe_gt.h"
> -#include "xe_gt_pagefault.h"
>  #include "xe_gt_printk.h"
>  #include "xe_gt_sriov_pf_control.h"
>  #include "xe_gt_sriov_pf_monitor.h"
> @@ -29,6 +28,7 @@
>  #include "xe_gt_tlb_invalidation.h"
>  #include "xe_guc.h"
>  #include "xe_guc_log.h"
> +#include "xe_guc_pagefault.h"
>  #include "xe_guc_relay.h"
>  #include "xe_guc_submit.h"
>  #include "xe_map.h"
> @@ -1419,10 +1419,6 @@ static int process_g2h_msg(struct xe_guc_ct
> *ct, u32 *msg, u32 len)
>                 ret = xe_guc_tlb_invalidation_done_handler(guc,
> payload,
>                                                            adj_len);
>                 break;
> -       case XE_GUC_ACTION_ACCESS_COUNTER_NOTIFY:
> -               ret = xe_guc_access_counter_notify_handler(guc,
> payload,
> -                                                          adj_len);
> -               break;
>         case XE_GUC_ACTION_GUC2PF_RELAY_FROM_VF:
>                 ret = xe_guc_relay_process_guc2pf(&guc->relay, hxg,
> hxg_len);
>                 break;
> diff --git a/drivers/gpu/drm/xe/xe_guc_pagefault.c
> b/drivers/gpu/drm/xe/xe_guc_pagefault.c
> new file mode 100644
> index 000000000000..0aa069d2a581
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_guc_pagefault.c
> @@ -0,0 +1,94 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#include "abi/guc_actions_abi.h"
> +#include "xe_guc.h"
> +#include "xe_guc_ct.h"
> +#include "xe_guc_pagefault.h"
> +#include "xe_pagefault.h"
> +
> +static void guc_ack_fault(struct xe_pagefault *pf, int err)
> +{
> +       u32 vfid = FIELD_GET(PFD_VFID, pf->producer.msg[2]);
> +       u32 engine_instance = FIELD_GET(PFD_ENG_INSTANCE, pf-
> >producer.msg[0]);
> +       u32 engine_class = FIELD_GET(PFD_ENG_CLASS, pf-
> >producer.msg[0]);
> +       u32 pdata = FIELD_GET(PFD_PDATA_LO, pf->producer.msg[0]) |
> +               (FIELD_GET(PFD_PDATA_HI, pf->producer.msg[1]) <<
> +                PFD_PDATA_HI_SHIFT);
> +       u32 action[] = {
> +               XE_GUC_ACTION_PAGE_FAULT_RES_DESC,
> +
> +               FIELD_PREP(PFR_VALID, 1) |
> +               FIELD_PREP(PFR_SUCCESS, !!err) |
> +               FIELD_PREP(PFR_REPLY, PFR_ACCESS) |
> +               FIELD_PREP(PFR_DESC_TYPE, FAULT_RESPONSE_DESC) |
> +               FIELD_PREP(PFR_ASID, pf->consumer.asid),
> +
> +               FIELD_PREP(PFR_VFID, vfid) |
> +               FIELD_PREP(PFR_ENG_INSTANCE, engine_instance) |
> +               FIELD_PREP(PFR_ENG_CLASS, engine_class) |
> +               FIELD_PREP(PFR_PDATA, pdata),
> +       };
> +       struct xe_guc *guc = pf->producer.private;
> +
> +       xe_guc_ct_send(&guc->ct, action, ARRAY_SIZE(action), 0, 0);
> +}
> +
> +static const struct xe_pagefault_ops guc_pagefault_ops = {
> +       .ack_fault = guc_ack_fault,
> +};
> +
> +/**
> + * xe_guc_pagefault_handler() - G2H page fault handler
> + * @guc: GuC object
> + * @msg: G2H message
> + * @len: Length of G2H message
> + *
> + * Parse GuC to host (G2H) message into a struct xe_pagefault and
> forward onto
> + * the Xe page fault layer.
> + *
> + * Return: 0 on success, errno on failure
> + */
> +int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len)
> +{
> +       struct xe_pagefault pf;
> +       int i;
> +
> +#define GUC_PF_MSG_LEN_DW      \
> +       (sizeof(struct xe_guc_pagefault_desc) / sizeof(u32))
> +
> +       BUILD_BUG_ON(GUC_PF_MSG_LEN_DW >
> XE_PAGEFAULT_PRODUCER_MSG_LEN_DW);
> +
> +       if (len != GUC_PF_MSG_LEN_DW)
> +               return -EPROTO;
> +
> +       pf.gt = guc_to_gt(guc);
> +
> +       /*
> +        * XXX: These values happen to match the enum in
> xe_pagefault_types.h.
> +        * If that changes, we’ll need to remap them here.
> +        */
> +       pf.consumer.page_addr = (u64)(FIELD_GET(PFD_VIRTUAL_ADDR_HI,
> msg[3])
> +                                     << PFD_VIRTUAL_ADDR_HI_SHIFT) |
> +               (FIELD_GET(PFD_VIRTUAL_ADDR_LO, msg[2]) <<
> +                PFD_VIRTUAL_ADDR_LO_SHIFT);
> +       pf.consumer.asid = FIELD_GET(PFD_ASID, msg[1]);
> +       pf.consumer.access_type = FIELD_GET(PFD_ACCESS_TYPE,
> msg[2]);;
> +       pf.consumer.fault_type = FIELD_GET(PFD_FAULT_TYPE, msg[2]);
> +       if (FIELD_GET(XE2_PFD_TRVA_FAULT, msg[0]))
> +               pf.consumer.fault_level = XE_PAGEFAULT_LEVEL_NACK;

We have a comment in the current implementation that says "sw isn't
expected to handle trtt faults". At a minimum it would be nice to keep
that here.

But really it would be nice to have a little documentation here as to
*why* we don't care about these types of faults. Should we print
something if this shows up, at least for debug?

> +       else
> +               pf.consumer.fault_level = FIELD_GET(PFD_FAULT_LEVEL,
> msg[0]);
> +       pf.consumer.engine_class = FIELD_GET(PFD_ENG_CLASS, msg[0]);

Again I think we should log the instance here as well.

Thanks,
Stuart


> +
> +       pf.producer.private = guc;
> +       pf.producer.ops = &guc_pagefault_ops;
> +       for (i = 0; i < GUC_PF_MSG_LEN_DW; ++i)
> +               pf.producer.msg[i] = msg[i];
> +
> +#undef GUC_PF_MSG_LEN_DW
> +
> +       return xe_pagefault_handler(guc_to_xe(guc), &pf);
> +}
> diff --git a/drivers/gpu/drm/xe/xe_guc_pagefault.h
> b/drivers/gpu/drm/xe/xe_guc_pagefault.h
> new file mode 100644
> index 000000000000..0723f57b8ea9
> --- /dev/null
> +++ b/drivers/gpu/drm/xe/xe_guc_pagefault.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2025 Intel Corporation
> + */
> +
> +#ifndef _XE_GUC_PAGEFAULT_H_
> +#define _XE_GUC_PAGEFAULT_H_
> +
> +#include <linux/types.h>
> +
> +int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len);
> +
> +#endif
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index 10c8a1bcb86e..1bcf3ba3b350 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -109,8 +109,7 @@ xe_svm_garbage_collector_add_range(struct xe_vm
> *vm, struct xe_svm_range *range,
>                               &vm->svm.garbage_collector.range_list);
>         spin_unlock(&vm->svm.garbage_collector.lock);
>  
> -       queue_work(xe_device_get_root_tile(xe)->primary_gt-
> >usm.pf_wq,
> -                  &vm->svm.garbage_collector.work);
> +       queue_work(xe->usm.pf_wq, &vm->svm.garbage_collector.work);
>  }
>  
>  static u8
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 432ea325677d..c9ae13c32117 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -27,7 +27,6 @@
>  #include "xe_device.h"
>  #include "xe_drm_client.h"
>  #include "xe_exec_queue.h"
> -#include "xe_gt_pagefault.h"
>  #include "xe_gt_tlb_invalidation.h"
>  #include "xe_migrate.h"
>  #include "xe_pat.h"


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/11] drm/xe: Thread prefetch of SVM ranges
  2025-08-06  6:22 ` [PATCH 10/11] drm/xe: Thread prefetch of SVM ranges Matthew Brost
@ 2025-08-28 22:55   ` Summers, Stuart
  2025-08-29  1:06     ` Matthew Brost
  0 siblings, 1 reply; 51+ messages in thread
From: Summers, Stuart @ 2025-08-28 22:55 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Brost,  Matthew
  Cc: Mrozek, Michal, Ghimiray, Himal Prasad,
	thomas.hellstrom@linux.intel.com, Dugast, Francois

On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> The migrate_vma_* functions are very CPU-intensive; as a result,
> prefetching SVM ranges is limited by CPU performance rather than
> paging
> copy engine bandwidth. To accelerate SVM range prefetching, the step
> that calls migrate_vma_* is now threaded. Reuses the page fault work
> queue for threading.
> 
> Running xe_exec_system_allocator --r prefetch-benchmark, which tests
> 64MB prefetches, shows an increase from ~4.35 GB/s to 12.25 GB/s with
> this patch on drm-tip. Enabling high SLPC further increases
> throughput
> to ~15.25 GB/s, and combining SLPC with ULLS raises it to ~16 GB/s.
> Both
> of these optimizations are upcoming.
> 
> v2:
>  - Use dedicated prefetch workqueue
>  - Pick dedicated prefetch thread count based on profiling
>  - Skip threaded prefetch for only 1 range or if prefetching to SRAM
>  - Fully tested
> v3:
>  - Use page fault work queue
> 
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_pagefault.c |  30 ++++++-
>  drivers/gpu/drm/xe/xe_svm.c       |  17 +++-
>  drivers/gpu/drm/xe/xe_vm.c        | 144 +++++++++++++++++++++++-----
> --
>  3 files changed, 152 insertions(+), 39 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> b/drivers/gpu/drm/xe/xe_pagefault.c
> index 95d2eb8566fb..f11c70ca6dd9 100644
> --- a/drivers/gpu/drm/xe/xe_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> @@ -177,7 +177,17 @@ static int xe_pagefault_service(struct
> xe_pagefault *pf)
>         if (IS_ERR(vm))
>                 return PTR_ERR(vm);
>  
> -       down_read(&vm->lock);
> +       /*
> +        * We can't block threaded prefetches from completing.
> down_read() can
> +        * block on a pending down_write(), so without a trylock
> here, we could
> +        * deadlock, since the page fault workqueue is shared with
> prefetches,
> +        * prefetches flush work items onto the same workqueue, and a
> +        * down_write() could be pending.
> +        */
> +       if (!down_read_trylock(&vm->lock)) {
> +               err = -EAGAIN;
> +               goto put_vm;
> +       }
>  
>         if (xe_vm_is_closed(vm)) {
>                 err = -ENOENT;
> @@ -202,11 +212,23 @@ static int xe_pagefault_service(struct
> xe_pagefault *pf)
>         if (!err)
>                 vm->usm.last_fault_vma = vma;
>         up_read(&vm->lock);
> +put_vm:
>         xe_vm_put(vm);
>  
>         return err;
>  }
>  
> +static void xe_pagefault_queue_retry(struct xe_pagefault_queue
> *pf_queue,
> +                                    struct xe_pagefault *pf)
> +{
> +       spin_lock_irq(&pf_queue->lock);
> +       if (!pf_queue->tail)
> +               pf_queue->tail = pf_queue->size -
> xe_pagefault_entry_size();
> +       else
> +               pf_queue->tail -= xe_pagefault_entry_size();
> +       spin_unlock_irq(&pf_queue->lock);
> +}
> +
>  static bool xe_pagefault_queue_pop(struct xe_pagefault_queue
> *pf_queue,
>                                    struct xe_pagefault *pf)
>  {
> @@ -259,7 +281,11 @@ static void xe_pagefault_queue_work(struct
> work_struct *w)
>                         continue;
>  
>                 err = xe_pagefault_service(&pf);
> -               if (err) {
> +               if (err == -EAGAIN) {
> +                       xe_pagefault_queue_retry(pf_queue, &pf);
> +                       queue_work(gt_to_xe(pf.gt)->usm.pf_wq, w);
> +                       break;
> +               } else if (err) {
>                         xe_pagefault_print(&pf);
>                         xe_gt_dbg(pf.gt, "Fault response:
> Unsuccessful %pe\n",
>                                   ERR_PTR(err));
> diff --git a/drivers/gpu/drm/xe/xe_svm.c
> b/drivers/gpu/drm/xe/xe_svm.c
> index 6e5d9ce7c76e..069ede2c7991 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -306,8 +306,19 @@ static void
> xe_svm_garbage_collector_work_func(struct work_struct *w)
>         struct xe_vm *vm = container_of(w, struct xe_vm,
>                                         svm.garbage_collector.work);
>  
> -       guard(rwsem_read)(&vm->lock);
> -       xe_svm_garbage_collector(vm);
> +       /*
> +        * We can't block threaded prefetches from completing.
> down_read() can
> +        * block on a pending down_write(), so without a trylock
> here, we could
> +        * deadlock, since the page fault workqueue is shared with
> prefetches,
> +        * prefetches flush work items onto the same workqueue, and a
> +        * down_write() could be pending.
> +        */
> +       if (down_read_trylock(&vm->lock)) {
> +               xe_svm_garbage_collector(vm);
> +               up_read(&vm->lock);
> +       } else {
> +               queue_work(vm->xe->usm.pf_wq, &vm-
> >svm.garbage_collector.work);
> +       }
>  }
>  
>  #if IS_ENABLED(CONFIG_DRM_XE_PAGEMAP)
> @@ -1148,5 +1159,5 @@ int xe_devm_add(struct xe_tile *tile, struct
> xe_vram_region *vr)
>  void xe_svm_flush(struct xe_vm *vm)
>  {
>         if (xe_vm_in_fault_mode(vm))
> -               flush_work(&vm->svm.garbage_collector.work);
> +               __flush_workqueue(vm->xe->usm.pf_wq);
>  }
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 3211827ef6d7..147b900b1f0b 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -2962,57 +2962,132 @@ static int check_ufence(struct xe_vma *vma)
>         return 0;
>  }
>  
> -static int prefetch_ranges(struct xe_vm *vm, struct xe_vma_op *op)
> +struct prefetch_thread {
> +       struct work_struct work;
> +       struct drm_gpusvm_ctx *ctx;
> +       struct xe_vma *vma;
> +       struct xe_svm_range *svm_range;
> +       struct xe_tile *tile;
> +       u32 region;
> +       int err;
> +};
> +
> +static void prefetch_thread_func(struct prefetch_thread *thread)
>  {
> -       bool devmem_possible = IS_DGFX(vm->xe) &&
> IS_ENABLED(CONFIG_DRM_XE_PAGEMAP);
> -       struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
> +       struct xe_vma *vma = thread->vma;
> +       struct xe_vm *vm = xe_vma_vm(vma);
> +       struct xe_svm_range *svm_range = thread->svm_range;
> +       u32 region = thread->region;
> +       struct xe_tile *tile = thread->tile;
>         int err = 0;
>  
> -       struct xe_svm_range *svm_range;
> +       guard(mutex)(&svm_range->lock);
> +
> +       if (xe_svm_range_is_removed(svm_range)) {
> +               thread->err = -ENODATA;
> +               return;
> +       }
> +
> +       if (!region) {
> +               xe_svm_range_migrate_to_smem(vm, svm_range);
> +       } else if (xe_svm_range_needs_migrate_to_vram(svm_range, vma,
> region)) {
> +               err = xe_svm_alloc_vram(tile, svm_range, thread-
> >ctx);
> +               if (err) {
> +                       drm_dbg(&vm->xe->drm,
> +                               "VRAM allocation failed, retry from
> userspace, asid=%u, gpusvm=%p, errno=%pe\n",
> +                               vm->usm.asid, &vm->svm.gpusvm,
> ERR_PTR(err));
> +                       thread->err = -ENODATA;
> +                       return;
> +               }
> +               xe_svm_range_debug(svm_range, "PREFETCH - RANGE
> MIGRATED TO VRAM");
> +       }
> +
> +       err = xe_svm_range_get_pages(vm, svm_range, thread->ctx);
> +       if (err) {
> +               drm_dbg(&vm->xe->drm, "Get pages failed, asid=%u,
> gpusvm=%p, errno=%pe\n",
> +                       vm->usm.asid, &vm->svm.gpusvm, ERR_PTR(err));
> +               if (err == -EOPNOTSUPP || err == -EFAULT || err == -
> EPERM)
> +                       err = -ENODATA;
> +               thread->err = err;
> +               return;
> +       }
> +
> +       xe_svm_range_debug(svm_range, "PREFETCH - RANGE GET PAGES
> DONE");
> +}
> +
> +static void prefetch_work_func(struct work_struct *w)
> +{
> +       struct prefetch_thread *thread =
> +               container_of(w, struct prefetch_thread, work);
> +
> +       prefetch_thread_func(thread);
> +}
> +
> +static int prefetch_ranges(struct xe_vm *vm, struct xe_vma_ops
> *vops,
> +                          struct xe_vma_op *op)
> +{
> +       struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
> +       u32 region = op->prefetch_range.region;
>         struct drm_gpusvm_ctx ctx = {};
> -       struct xe_tile *tile;
> +       struct prefetch_thread stack_thread;
> +       struct xe_svm_range *svm_range;
> +       struct prefetch_thread *prefetches;
> +       bool sram = region_to_mem_type[region] == XE_PL_TT;
> +       struct xe_tile *tile = sram ? xe_device_get_root_tile(vm->xe)
> :
> +               &vm->xe->tiles[region_to_mem_type[region] -
> XE_PL_VRAM0];
>         unsigned long i;
> -       u32 region;
> +       bool devmem_possible = IS_DGFX(vm->xe) &&
> +               IS_ENABLED(CONFIG_DRM_XE_PAGEMAP);
> +       bool skip_threads = op->prefetch_range.ranges_count == 1 ||
> sram ||

Starting to work through these.. shouldn't we also allow the user to
opportunistically skip this (cgroup/sysfs/etc)? I realize the
microbenchmark shows some improvement, but some of the workloads might
also be much more heavy on the CPU side and we don't want to throttle
that with the extra kernel threads if they aren't heavy on the fault
side.

Thanks,
Stuart

> +               !(vops->flags & XE_VMA_OPS_FLAG_DOWNGRADE_LOCK);
> +       struct prefetch_thread *thread = skip_threads ? &stack_thread
> : NULL;
> +       int err = 0, idx = 0;
>  
>         if (!xe_vma_is_cpu_addr_mirror(vma))
>                 return 0;
>  
> -       region = op->prefetch_range.region;
> +       if (!skip_threads) {
> +               prefetches = kvmalloc_array(op-
> >prefetch_range.ranges_count,
> +                                           sizeof(*prefetches),
> GFP_KERNEL);
> +               if (!prefetches)
> +                       return -ENOMEM;
> +       }
>  
>         ctx.read_only = xe_vma_read_only(vma);
>         ctx.devmem_possible = devmem_possible;
>         ctx.check_pages_threshold = devmem_possible ? SZ_64K : 0;
>  
> -       /* TODO: Threading the migration */
>         xa_for_each(&op->prefetch_range.range, i, svm_range) {
> -               guard(mutex)(&svm_range->lock);
> -
> -               if (xe_svm_range_is_removed(svm_range))
> -                       return -ENODATA;
> -
> -               if (!region)
> -                       xe_svm_range_migrate_to_smem(vm, svm_range);
> +               if (!skip_threads) {
> +                       thread = prefetches + idx++;
> +                       INIT_WORK(&thread->work, prefetch_work_func);
> +               }
>  
> -               if (xe_svm_range_needs_migrate_to_vram(svm_range,
> vma, region)) {
> -                       tile = &vm->xe-
> >tiles[region_to_mem_type[region] - XE_PL_VRAM0];
> -                       err = xe_svm_alloc_vram(tile, svm_range,
> &ctx);
> -                       if (err) {
> -                               drm_dbg(&vm->xe->drm, "VRAM
> allocation failed, retry from userspace, asid=%u, gpusvm=%p,
> errno=%pe\n",
> -                                       vm->usm.asid, &vm-
> >svm.gpusvm, ERR_PTR(err));
> -                               return -ENODATA;
> -                       }
> -                       xe_svm_range_debug(svm_range, "PREFETCH -
> RANGE MIGRATED TO VRAM");
> +               thread->ctx = &ctx;
> +               thread->vma = vma;
> +               thread->svm_range = svm_range;
> +               thread->tile = tile;
> +               thread->region = region;
> +               thread->err = 0;
> +
> +               if (skip_threads) {
> +                       prefetch_thread_func(thread);
> +                       if (thread->err)
> +                               return thread->err;
> +               } else {
> +                       queue_work(vm->xe->usm.pf_wq, &thread->work);
>                 }
> +       }
>  
> -               err = xe_svm_range_get_pages(vm, svm_range, &ctx);
> -               if (err) {
> -                       drm_dbg(&vm->xe->drm, "Get pages failed,
> asid=%u, gpusvm=%p, errno=%pe\n",
> -                               vm->usm.asid, &vm->svm.gpusvm,
> ERR_PTR(err));
> -                       if (err == -EOPNOTSUPP || err == -EFAULT ||
> err == -EPERM)
> -                               err = -ENODATA;
> -                       return err;
> +       if (!skip_threads) {
> +               for (i = 0; i < idx; ++i) {
> +                       thread = prefetches + i;
> +
> +                       flush_work(&thread->work);
> +                       if (thread->err && (!err || err == -ENODATA))
> +                               err = thread->err;
>                 }
> -               xe_svm_range_debug(svm_range, "PREFETCH - RANGE GET
> PAGES DONE");
> +               kvfree(prefetches);
>         }
>  
>         return err;
> @@ -3079,7 +3154,8 @@ static int op_lock_and_prep(struct drm_exec
> *exec, struct xe_vm *vm,
>         return err;
>  }
>  
> -static int vm_bind_ioctl_ops_prefetch_ranges(struct xe_vm *vm,
> struct xe_vma_ops *vops)
> +static int vm_bind_ioctl_ops_prefetch_ranges(struct xe_vm *vm,
> +                                            struct xe_vma_ops *vops)
>  {
>         struct xe_vma_op *op;
>         int err;
> @@ -3089,7 +3165,7 @@ static int
> vm_bind_ioctl_ops_prefetch_ranges(struct xe_vm *vm, struct xe_vma_ops
>  
>         list_for_each_entry(op, &vops->list, link) {
>                 if (op->base.op  == DRM_GPUVA_OP_PREFETCH) {
> -                       err = prefetch_ranges(vm, op);
> +                       err = prefetch_ranges(vm, vops, op);
>                         if (err)
>                                 return err;
>                 }


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 11/11] drm/xe: Add num_pf_queue modparam
  2025-08-06  6:22 ` [PATCH 11/11] drm/xe: Add num_pf_queue modparam Matthew Brost
@ 2025-08-28 22:58   ` Summers, Stuart
  0 siblings, 0 replies; 51+ messages in thread
From: Summers, Stuart @ 2025-08-28 22:58 UTC (permalink / raw)
  To: intel-xe@lists.freedesktop.org, Brost,  Matthew
  Cc: Mrozek, Michal, Ghimiray, Himal Prasad,
	thomas.hellstrom@linux.intel.com, Dugast, Francois

On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> Enable quick experiment to see how number of page fault queues
> affects
> performance.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_device.c       | 15 +++++++++++++--
>  drivers/gpu/drm/xe/xe_device_types.h |  6 ++++--
>  drivers/gpu/drm/xe/xe_module.c       |  5 +++++
>  drivers/gpu/drm/xe/xe_module.h       |  1 +
>  drivers/gpu/drm/xe/xe_pagefault.c    |  8 ++++----
>  drivers/gpu/drm/xe/xe_vm.c           |  3 ++-
>  6 files changed, 29 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device.c
> b/drivers/gpu/drm/xe/xe_device.c
> index c7c8aee03841..47eb07e9c799 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -413,6 +413,17 @@ static void xe_device_destroy(struct drm_device
> *dev, void *dummy)
>         ttm_device_fini(&xe->ttm);
>  }
>  
> +static void xe_device_parse_modparame(struct xe_device *xe)
> +{
> +       xe->info.force_execlist = xe_modparam.force_execlist;
> +       xe->info.num_pf_queue = xe_modparam.num_pf_queue;
> +       if (xe->info.num_pf_queue < 1)
> +               xe->info.num_pf_queue = 1;
> +       else if (xe->info.num_pf_queue > XE_PAGEFAULT_QUEUE_MAX)
> +               xe->info.num_pf_queue = XE_PAGEFAULT_QUEUE_MAX;
> +       xe->atomic_svm_timeslice_ms = 5;
> +}
> +
>  struct xe_device *xe_device_create(struct pci_dev *pdev,
>                                    const struct pci_device_id *ent)
>  {
> @@ -446,8 +457,8 @@ struct xe_device *xe_device_create(struct pci_dev
> *pdev,
>  
>         xe->info.devid = pdev->device;
>         xe->info.revid = pdev->revision;
> -       xe->info.force_execlist = xe_modparam.force_execlist;
> -       xe->atomic_svm_timeslice_ms = 5;
> +
> +       xe_device_parse_modparame(xe);
>  
>         err = xe_irq_init(xe);
>         if (err)
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> b/drivers/gpu/drm/xe/xe_device_types.h
> index 02b91a698500..d5c5fd7972a1 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -243,6 +243,8 @@ struct xe_device {
>                 u8 revid;
>                 /** @info.step: stepping information for each IP */
>                 struct xe_step_info step;
> +               /** @info.num_pf_queue: Number of page fault queues
> */
> +               int num_pf_queue;
>                 /** @info.dma_mask_size: DMA address bits */
>                 u8 dma_mask_size;
>                 /** @info.vram_flags: Vram flags */
> @@ -399,9 +401,9 @@ struct xe_device {
>                 struct rw_semaphore lock;
>                 /** @usm.pf_wq: page fault work queue, unbound, high
> priority */
>                 struct workqueue_struct *pf_wq;
> -#define XE_PAGEFAULT_QUEUE_COUNT       4
> +#define XE_PAGEFAULT_QUEUE_MAX 8
>                 /** @pf_queue: Page fault queues */
> -               struct xe_pagefault_queue
> pf_queue[XE_PAGEFAULT_QUEUE_COUNT];
> +               struct xe_pagefault_queue
> pf_queue[XE_PAGEFAULT_QUEUE_MAX];
>         } usm;
>  
>         /** @pinned: pinned BO state */
> diff --git a/drivers/gpu/drm/xe/xe_module.c
> b/drivers/gpu/drm/xe/xe_module.c
> index d08338fc3bc1..0671ae9d9e5a 100644
> --- a/drivers/gpu/drm/xe/xe_module.c
> +++ b/drivers/gpu/drm/xe/xe_module.c
> @@ -27,6 +27,7 @@
>  #define DEFAULT_PROBE_DISPLAY          true
>  #define DEFAULT_VRAM_BAR_SIZE          0
>  #define DEFAULT_FORCE_PROBE            CONFIG_DRM_XE_FORCE_PROBE
> +#define DEFAULT_NUM_PF_QUEUE           4
>  #define DEFAULT_MAX_VFS                        ~0
>  #define DEFAULT_MAX_VFS_STR            "unlimited"
>  #define DEFAULT_WEDGED_MODE            1
> @@ -40,6 +41,7 @@ struct xe_modparam xe_modparam = {
>         .max_vfs =              DEFAULT_MAX_VFS,
>  #endif
>         .wedged_mode =          DEFAULT_WEDGED_MODE,
> +       .num_pf_queue =         DEFAULT_NUM_PF_QUEUE,
>         .svm_notifier_size =    DEFAULT_SVM_NOTIFIER_SIZE,
>         /* the rest are 0 by default */
>  };
> @@ -93,6 +95,9 @@ MODULE_PARM_DESC(wedged_mode,
>                  "Module's default policy for the wedged mode
> (0=never, 1=upon-critical-errors, 2=upon-any-hang "
>                  "[default=" __stringify(DEFAULT_WEDGED_MODE) "])");
>  
> +module_param_named(num_pf_queue, xe_modparam.num_pf_queue, int,
> 0600);
> +MODULE_PARM_DESC(num_pf_queue, "Number of page fault queue,
> default=4, min=1, max=8");
> +
>  static int xe_check_nomodeset(void)
>  {
>         if (drm_firmware_drivers_only())
> diff --git a/drivers/gpu/drm/xe/xe_module.h
> b/drivers/gpu/drm/xe/xe_module.h
> index 5a3bfea8b7b4..36ac2151fe16 100644
> --- a/drivers/gpu/drm/xe/xe_module.h
> +++ b/drivers/gpu/drm/xe/xe_module.h
> @@ -22,6 +22,7 @@ struct xe_modparam {
>         unsigned int max_vfs;
>  #endif
>         int wedged_mode;
> +       int num_pf_queue;
>         u32 svm_notifier_size;
>  };
>  
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> b/drivers/gpu/drm/xe/xe_pagefault.c
> index f11c70ca6dd9..3c69557c6aa9 100644
> --- a/drivers/gpu/drm/xe/xe_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> @@ -373,11 +373,11 @@ int xe_pagefault_init(struct xe_device *xe)
>  
>         xe->usm.pf_wq = alloc_workqueue("xe_page_fault_work_queue",
>                                         WQ_UNBOUND | WQ_HIGHPRI,
> -                                       XE_PAGEFAULT_QUEUE_COUNT);
> +                                       xe->info.num_pf_queue);
>         if (!xe->usm.pf_wq)
>                 return -ENOMEM;
>  
> -       for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i) {
> +       for (i = 0; i < xe->info.num_pf_queue; ++i) {
>                 err = xe_pagefault_queue_init(xe, xe->usm.pf_queue +
> i);
>                 if (err)
>                         goto err_out;
> @@ -420,7 +420,7 @@ void xe_pagefault_reset(struct xe_device *xe,
> struct xe_gt *gt)
>  {
>         int i;
>  
> -       for (i = 0; i < XE_PAGEFAULT_QUEUE_COUNT; ++i)
> +       for (i = 0; i < xe->info.num_pf_queue; ++i)
>                 xe_pagefault_queue_reset(xe, gt, xe->usm.pf_queue +
> i);
>  }
>  
> @@ -442,7 +442,7 @@ static int xe_pagefault_queue_index(struct
> xe_device *xe)
>  
>         WRITE_ONCE(xe->usm.current_pf_queue, (old_pf_queue + 1));
>  
> -       return old_pf_queue % XE_PAGEFAULT_QUEUE_COUNT;
> +       return old_pf_queue % xe->info.num_pf_queue;
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> index 147b900b1f0b..67000c4466ab 100644
> --- a/drivers/gpu/drm/xe/xe_vm.c
> +++ b/drivers/gpu/drm/xe/xe_vm.c
> @@ -3039,7 +3039,8 @@ static int prefetch_ranges(struct xe_vm *vm,
> struct xe_vma_ops *vops,
>         bool devmem_possible = IS_DGFX(vm->xe) &&
>                 IS_ENABLED(CONFIG_DRM_XE_PAGEMAP);
>         bool skip_threads = op->prefetch_range.ranges_count == 1 ||
> sram ||
> -               !(vops->flags & XE_VMA_OPS_FLAG_DOWNGRADE_LOCK);
> +               !(vops->flags & XE_VMA_OPS_FLAG_DOWNGRADE_LOCK) ||
> +               vm->xe->info.num_pf_queue == 1;

Ah ok well this does add that, but we still might want to skip but with
the default number of queues (4).

Also, should we make this a configfs to allow the users to configure
this per device for more fine tuning? I understand you have this just
for local debug right now...

Thanks,
Stuart

>         struct prefetch_thread *thread = skip_threads ? &stack_thread
> : NULL;
>         int err = 0, idx = 0;
>  


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/11] drm/xe: Implement xe_pagefault_queue_work
  2025-08-28 22:04   ` Summers, Stuart
@ 2025-08-29  0:51     ` Matthew Brost
  0 siblings, 0 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-29  0:51 UTC (permalink / raw)
  To: Summers, Stuart
  Cc: intel-xe@lists.freedesktop.org, Mrozek, Michal,
	Ghimiray, Himal Prasad, thomas.hellstrom@linux.intel.com,
	Dugast, Francois

On Thu, Aug 28, 2025 at 04:04:13PM -0600, Summers, Stuart wrote:
> On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> > Implement a worker that services page faults, using the same
> > implementation as in xe_gt_pagefault.c.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_pagefault.c | 240
> > +++++++++++++++++++++++++++++-
> >  1 file changed, 239 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> > b/drivers/gpu/drm/xe/xe_pagefault.c
> > index 98be3203a9df..474412c21ec3 100644
> > --- a/drivers/gpu/drm/xe/xe_pagefault.c
> > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > @@ -5,12 +5,20 @@
> >  
> >  #include <linux/circ_buf.h>
> >  
> > +#include <drm/drm_exec.h>
> >  #include <drm/drm_managed.h>
> >  
> > +#include "xe_bo.h"
> >  #include "xe_device.h"
> > +#include "xe_gt_printk.h"
> >  #include "xe_gt_types.h"
> > +#include "xe_gt_stats.h"
> > +#include "xe_hw_engine.h"
> >  #include "xe_pagefault.h"
> >  #include "xe_pagefault_types.h"
> > +#include "xe_svm.h"
> > +#include "xe_trace_bo.h"
> > +#include "xe_vm.h"
> >  
> >  /**
> >   * DOC: Xe page faults
> > @@ -30,9 +38,239 @@ static int xe_pagefault_entry_size(void)
> >         return roundup_pow_of_two(sizeof(struct xe_pagefault));
> >  }
> >  
> > +static int xe_pagefault_begin(struct drm_exec *exec, struct xe_vma
> > *vma,
> > +                             bool atomic, unsigned int id)
> > +{
> > +       struct xe_bo *bo = xe_vma_bo(vma);
> > +       struct xe_vm *vm = xe_vma_vm(vma);
> > +       int err;
> > +
> > +       err = xe_vm_lock_vma(exec, vma);
> > +       if (err)
> > +               return err;
> > +
> > +       if (atomic && IS_DGFX(vm->xe)) {
> > +               if (xe_vma_is_userptr(vma)) {
> > +                       err = -EACCES;
> > +                       return err;
> > +               }
> > +
> > +               /* Migrate to VRAM, move should invalidate the VMA
> > first */
> > +               err = xe_bo_migrate(bo, XE_PL_VRAM0 + id);
> > +               if (err)
> > +                       return err;
> > +       } else if (bo) {
> > +               /* Create backing store if needed */
> > +               err = xe_bo_validate(bo, vm, true);
> > +               if (err)
> > +                       return err;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static int xe_pagefault_handle_vma(struct xe_gt *gt, struct xe_vma
> > *vma,
> > +                                  bool atomic)
> > +{
> > +       struct xe_vm *vm = xe_vma_vm(vma);
> > +       struct xe_tile *tile = gt_to_tile(gt);
> > +       struct drm_exec exec;
> > +       struct dma_fence *fence;
> > +       ktime_t end = 0;
> > +       int err;
> > +
> > +       lockdep_assert_held_write(&vm->lock);
> > +
> > +       xe_gt_stats_incr(gt, XE_GT_STATS_ID_VMA_PAGEFAULT_COUNT, 1);
> > +       xe_gt_stats_incr(gt, XE_GT_STATS_ID_VMA_PAGEFAULT_KB,
> > +                        xe_vma_size(vma) / SZ_1K);
> > +
> > +       trace_xe_vma_pagefault(vma);
> > +
> > +       /* Check if VMA is valid, opportunistic check only */
> > +       if (xe_vm_has_valid_gpu_mapping(tile, vma->tile_present,
> > +                                       vma->tile_invalidated) &&
> > !atomic)
> > +               return 0;
> > +
> > +retry_userptr:
> > +       if (xe_vma_is_userptr(vma) &&
> > +           xe_vma_userptr_check_repin(to_userptr_vma(vma))) {
> > +               struct xe_userptr_vma *uvma = to_userptr_vma(vma);
> > +
> > +               err = xe_vma_userptr_pin_pages(uvma);
> > +               if (err)
> > +                       return err;
> > +       }
> > +
> > +       /* Lock VM and BOs dma-resv */
> > +       drm_exec_init(&exec, 0, 0);
> > +       drm_exec_until_all_locked(&exec) {
> > +               err = xe_pagefault_begin(&exec, vma, atomic, tile-
> > >id);
> > +               drm_exec_retry_on_contention(&exec);
> > +               if (xe_vm_validate_should_retry(&exec, err, &end))
> > +                       err = -EAGAIN;
> > +               if (err)
> > +                       goto unlock_dma_resv;
> > +
> > +               /* Bind VMA only to the GT that has faulted */
> > +               trace_xe_vma_pf_bind(vma);
> > +               fence = xe_vma_rebind(vm, vma, BIT(tile->id));
> > +               if (IS_ERR(fence)) {
> > +                       err = PTR_ERR(fence);
> > +                       if (xe_vm_validate_should_retry(&exec, err,
> > &end))
> > +                               err = -EAGAIN;
> > +                       goto unlock_dma_resv;
> > +               }
> > +       }
> > +
> > +       dma_fence_wait(fence, false);
> > +       dma_fence_put(fence);
> > +
> > +unlock_dma_resv:
> > +       drm_exec_fini(&exec);
> > +       if (err == -EAGAIN)
> > +               goto retry_userptr;
> > +
> > +       return err;
> > +}
> > +
> > +static bool
> > +xe_pagefault_access_is_atomic(enum xe_pagefault_access_type
> > access_type)
> > +{
> > +       return access_type == XE_PAGEFAULT_ACCESS_TYPE_ATOMIC;
> > +}
> > +
> > +static struct xe_vm *xe_pagefault_asid_to_vm(struct xe_device *xe,
> > u32 asid)
> > +{
> > +       struct xe_vm *vm;
> > +
> > +       down_read(&xe->usm.lock);
> > +       vm = xa_load(&xe->usm.asid_to_vm, asid);
> > +       if (vm && xe_vm_in_fault_mode(vm))
> > +               xe_vm_get(vm);
> > +       else
> > +               vm = ERR_PTR(-EINVAL);
> > +       up_read(&xe->usm.lock);
> > +
> > +       return vm;
> > +}
> > +
> > +static int xe_pagefault_service(struct xe_pagefault *pf)
> > +{
> > +       struct xe_gt *gt = pf->gt;
> > +       struct xe_device *xe = gt_to_xe(gt);
> > +       struct xe_vm *vm;
> > +       struct xe_vma *vma = NULL;
> > +       int err;
> > +       bool atomic;
> > +
> > +       /* Producer flagged this fault to be nacked */
> > +       if (pf->consumer.fault_level == XE_PAGEFAULT_LEVEL_NACK)
> > +               return -EFAULT;
> > +
> > +       vm = xe_pagefault_asid_to_vm(xe, pf->consumer.asid);
> > +       if (IS_ERR(vm))
> > +               return PTR_ERR(vm);
> > +
> > +       /*
> > +        * TODO: Change to read lock? Using write lock for
> > simplicity.
> > +        */
> > +       down_write(&vm->lock);
> > +
> > +       if (xe_vm_is_closed(vm)) {
> > +               err = -ENOENT;
> > +               goto unlock_vm;
> > +       }
> > +
> > +       vma = xe_vm_find_vma_by_addr(vm, pf->consumer.page_addr);
> > +       if (!vma) {
> > +               err = -EINVAL;
> > +               goto unlock_vm;
> > +       }
> > +
> > +       atomic = xe_pagefault_access_is_atomic(pf-
> > >consumer.access_type);
> > +
> > +       if (xe_vma_is_cpu_addr_mirror(vma))
> > +               err = xe_svm_handle_pagefault(vm, vma, gt,
> > +                                             pf->consumer.page_addr,
> > atomic);
> > +       else
> > +               err = xe_pagefault_handle_vma(gt, vma, atomic);
> > +
> > +unlock_vm:
> > +       if (!err)
> > +               vm->usm.last_fault_vma = vma;
> > +       up_write(&vm->lock);
> > +       xe_vm_put(vm);
> > +
> > +       return err;
> > +}
> > +
> > +static bool xe_pagefault_queue_pop(struct xe_pagefault_queue
> > *pf_queue,
> > +                                  struct xe_pagefault *pf)
> > +{
> > +       bool found_fault = false;
> > +
> > +       spin_lock_irq(&pf_queue->lock);
> > +       if (pf_queue->tail != pf_queue->head) {
> > +               memcpy(pf, pf_queue->data + pf_queue->tail,
> > sizeof(*pf));
> > +               pf_queue->tail = (pf_queue->tail +
> > xe_pagefault_entry_size()) %
> > +                       pf_queue->size;
> > +               found_fault = true;
> > +       }
> > +       spin_unlock_irq(&pf_queue->lock);
> > +
> > +       return found_fault;
> > +}
> > +
> > +static void xe_pagefault_print(struct xe_pagefault *pf)
> > +{
> > +       xe_gt_dbg(pf->gt, "\n\tASID: %d\n"
> > +                 "\tFaulted Address: 0x%08x%08x\n"
> > +                 "\tFaultType: %d\n"
> > +                 "\tAccessType: %d\n"
> > +                 "\tFaultLevel: %d\n"
> > +                 "\tEngineClass: %d %s\n",
> > +                 pf->consumer.asid,
> > +                 upper_32_bits(pf->consumer.page_addr),
> > +                 lower_32_bits(pf->consumer.page_addr),
> > +                 pf->consumer.fault_type,
> > +                 pf->consumer.access_type,
> > +                 pf->consumer.fault_level,
> > +                 pf->consumer.engine_class,
> > +                 xe_hw_engine_class_to_str(pf-
> > >consumer.engine_class));
> > +}
> > +
> >  static void xe_pagefault_queue_work(struct work_struct *w)
> >  {
> > -       /* TODO: Implement */
> > +       struct xe_pagefault_queue *pf_queue =
> > +               container_of(w, typeof(*pf_queue), worker);
> > +       struct xe_pagefault pf;
> > +       unsigned long threshold;
> > +
> > +#define USM_QUEUE_MAX_RUNTIME_MS      20
> > +       threshold = jiffies +
> > msecs_to_jiffies(USM_QUEUE_MAX_RUNTIME_MS);
> > +
> > +       while (xe_pagefault_queue_pop(pf_queue, &pf)) {
> > +               int err;
> > +
> > +               if (!pf.gt)     /* Fault squashed during reset */
> > +                       continue;
> > +
> > +               err = xe_pagefault_service(&pf);
> > +               if (err) {
> > +                       xe_pagefault_print(&pf);
> 
> I realize you're just copying over the existing functionality here.
> Since this is already, but should we change this to an info and use dbg
> to just print all incoming faults?
> 

That spam you'd get would be enormous. Every section of
xe_exec_system_alloc trigger 100s, if not 1000s of faults. We already
have ftrace points for faults if you really want that information.

Matt

> Anyway even if we do want to do this, not really applicable here since
> this is just copying as mentioned.
> 
> With the changes Francois suggested in place:
> Reviewed-by: Stuart Summers <stuart.summers@intel.com>
> 
> Thanks,
> Stuart
> 
> > +                       xe_gt_dbg(pf.gt, "Fault response:
> > Unsuccessful %pe\n",
> > +                                 ERR_PTR(err));
> > +               }
> > +
> > +               pf.producer.ops->ack_fault(&pf, err);
> > +
> > +               if (time_after(jiffies, threshold)) {
> > +                       queue_work(gt_to_xe(pf.gt)->usm.pf_wq, w);
> > +                       break;
> > +               }
> > +       }
> > +#undef USM_QUEUE_MAX_RUNTIME_MS
> >  }
> >  
> >  static int xe_pagefault_queue_init(struct xe_device *xe,
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/11] drm/xe: Add xe_guc_pagefault layer
  2025-08-28 22:11   ` Summers, Stuart
@ 2025-08-29  0:54     ` Matthew Brost
  0 siblings, 0 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-29  0:54 UTC (permalink / raw)
  To: Summers, Stuart
  Cc: intel-xe@lists.freedesktop.org, Mrozek, Michal,
	Ghimiray, Himal Prasad, thomas.hellstrom@linux.intel.com,
	Dugast, Francois

On Thu, Aug 28, 2025 at 04:11:49PM -0600, Summers, Stuart wrote:
> On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> > Add xe_guc_pagefault layer (producer) which parses G2H fault messages
> > messages into struct xe_pagefault, forwards them to the page fault
> > layer
> > (consumer) for servicing, and provides a vfunc to acknowledge faults
> > to
> > the GuC upon completion. Replace the old (and incorrect) GT page
> > fault
> > layer with this new layer throughout the driver.
> > 
> > Signed-off-bt: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/Makefile           |  2 +-
> >  drivers/gpu/drm/xe/xe_gt.c            |  6 --
> >  drivers/gpu/drm/xe/xe_guc_ct.c        |  6 +-
> >  drivers/gpu/drm/xe/xe_guc_pagefault.c | 94
> > +++++++++++++++++++++++++++
> >  drivers/gpu/drm/xe/xe_guc_pagefault.h | 13 ++++
> >  drivers/gpu/drm/xe/xe_svm.c           |  3 +-
> >  drivers/gpu/drm/xe/xe_vm.c            |  1 -
> >  7 files changed, 110 insertions(+), 15 deletions(-)
> >  create mode 100644 drivers/gpu/drm/xe/xe_guc_pagefault.c
> >  create mode 100644 drivers/gpu/drm/xe/xe_guc_pagefault.h
> > 
> > diff --git a/drivers/gpu/drm/xe/Makefile
> > b/drivers/gpu/drm/xe/Makefile
> > index 6fbebafe79c9..c103c114b75c 100644
> > --- a/drivers/gpu/drm/xe/Makefile
> > +++ b/drivers/gpu/drm/xe/Makefile
> > @@ -58,7 +58,6 @@ xe-y += xe_bb.o \
> >         xe_gt_freq.o \
> >         xe_gt_idle.o \
> >         xe_gt_mcr.o \
> > -       xe_gt_pagefault.o \
> >         xe_gt_sysfs.o \
> >         xe_gt_throttle.o \
> >         xe_gt_tlb_invalidation.o \
> > @@ -75,6 +74,7 @@ xe-y += xe_bb.o \
> >         xe_guc_id_mgr.o \
> >         xe_guc_klv_helpers.o \
> >         xe_guc_log.o \
> > +       xe_guc_pagefault.o \
> >         xe_guc_pc.o \
> >         xe_guc_submit.o \
> >         xe_heci_gsc.o \
> > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > index 5aa03f89a062..35c7ba7828a6 100644
> > --- a/drivers/gpu/drm/xe/xe_gt.c
> > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > @@ -32,7 +32,6 @@
> >  #include "xe_gt_freq.h"
> >  #include "xe_gt_idle.h"
> >  #include "xe_gt_mcr.h"
> > -#include "xe_gt_pagefault.h"
> >  #include "xe_gt_printk.h"
> >  #include "xe_gt_sriov_pf.h"
> >  #include "xe_gt_sriov_vf.h"
> > @@ -634,10 +633,6 @@ int xe_gt_init(struct xe_gt *gt)
> >         if (err)
> >                 return err;
> >  
> > -       err = xe_gt_pagefault_init(gt);
> > -       if (err)
> > -               return err;
> > -
> >         err = xe_gt_idle_init(&gt->gtidle);
> >         if (err)
> >                 return err;
> > @@ -848,7 +843,6 @@ static int gt_reset(struct xe_gt *gt)
> >         xe_uc_gucrc_disable(&gt->uc);
> >         xe_uc_stop_prepare(&gt->uc);
> >         xe_pagefault_reset(gt_to_xe(gt), gt);
> > -       xe_gt_pagefault_reset(gt);
> >  
> >         xe_uc_stop(&gt->uc);
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c
> > b/drivers/gpu/drm/xe/xe_guc_ct.c
> > index 3f4e6a46ff16..67b5dd182207 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > @@ -21,7 +21,6 @@
> >  #include "xe_devcoredump.h"
> >  #include "xe_device.h"
> >  #include "xe_gt.h"
> > -#include "xe_gt_pagefault.h"
> >  #include "xe_gt_printk.h"
> >  #include "xe_gt_sriov_pf_control.h"
> >  #include "xe_gt_sriov_pf_monitor.h"
> > @@ -29,6 +28,7 @@
> >  #include "xe_gt_tlb_invalidation.h"
> >  #include "xe_guc.h"
> >  #include "xe_guc_log.h"
> > +#include "xe_guc_pagefault.h"
> >  #include "xe_guc_relay.h"
> >  #include "xe_guc_submit.h"
> >  #include "xe_map.h"
> > @@ -1419,10 +1419,6 @@ static int process_g2h_msg(struct xe_guc_ct
> > *ct, u32 *msg, u32 len)
> >                 ret = xe_guc_tlb_invalidation_done_handler(guc,
> > payload,
> >                                                            adj_len);
> >                 break;
> > -       case XE_GUC_ACTION_ACCESS_COUNTER_NOTIFY:
> > -               ret = xe_guc_access_counter_notify_handler(guc,
> > payload,
> > -                                                          adj_len);
> > -               break;
> >         case XE_GUC_ACTION_GUC2PF_RELAY_FROM_VF:
> >                 ret = xe_guc_relay_process_guc2pf(&guc->relay, hxg,
> > hxg_len);
> >                 break;
> > diff --git a/drivers/gpu/drm/xe/xe_guc_pagefault.c
> > b/drivers/gpu/drm/xe/xe_guc_pagefault.c
> > new file mode 100644
> > index 000000000000..0aa069d2a581
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_guc_pagefault.c
> > @@ -0,0 +1,94 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2025 Intel Corporation
> > + */
> > +
> > +#include "abi/guc_actions_abi.h"
> > +#include "xe_guc.h"
> > +#include "xe_guc_ct.h"
> > +#include "xe_guc_pagefault.h"
> > +#include "xe_pagefault.h"
> > +
> > +static void guc_ack_fault(struct xe_pagefault *pf, int err)
> > +{
> > +       u32 vfid = FIELD_GET(PFD_VFID, pf->producer.msg[2]);
> > +       u32 engine_instance = FIELD_GET(PFD_ENG_INSTANCE, pf-
> > >producer.msg[0]);
> > +       u32 engine_class = FIELD_GET(PFD_ENG_CLASS, pf-
> > >producer.msg[0]);
> > +       u32 pdata = FIELD_GET(PFD_PDATA_LO, pf->producer.msg[0]) |
> > +               (FIELD_GET(PFD_PDATA_HI, pf->producer.msg[1]) <<
> > +                PFD_PDATA_HI_SHIFT);
> > +       u32 action[] = {
> > +               XE_GUC_ACTION_PAGE_FAULT_RES_DESC,
> > +
> > +               FIELD_PREP(PFR_VALID, 1) |
> > +               FIELD_PREP(PFR_SUCCESS, !!err) |
> > +               FIELD_PREP(PFR_REPLY, PFR_ACCESS) |
> > +               FIELD_PREP(PFR_DESC_TYPE, FAULT_RESPONSE_DESC) |
> > +               FIELD_PREP(PFR_ASID, pf->consumer.asid),
> > +
> > +               FIELD_PREP(PFR_VFID, vfid) |
> > +               FIELD_PREP(PFR_ENG_INSTANCE, engine_instance) |
> > +               FIELD_PREP(PFR_ENG_CLASS, engine_class) |
> > +               FIELD_PREP(PFR_PDATA, pdata),
> > +       };
> > +       struct xe_guc *guc = pf->producer.private;
> > +
> > +       xe_guc_ct_send(&guc->ct, action, ARRAY_SIZE(action), 0, 0);
> > +}
> > +
> > +static const struct xe_pagefault_ops guc_pagefault_ops = {
> > +       .ack_fault = guc_ack_fault,
> > +};
> > +
> > +/**
> > + * xe_guc_pagefault_handler() - G2H page fault handler
> > + * @guc: GuC object
> > + * @msg: G2H message
> > + * @len: Length of G2H message
> > + *
> > + * Parse GuC to host (G2H) message into a struct xe_pagefault and
> > forward onto
> > + * the Xe page fault layer.
> > + *
> > + * Return: 0 on success, errno on failure
> > + */
> > +int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len)
> > +{
> > +       struct xe_pagefault pf;
> > +       int i;
> > +
> > +#define GUC_PF_MSG_LEN_DW      \
> > +       (sizeof(struct xe_guc_pagefault_desc) / sizeof(u32))
> > +
> > +       BUILD_BUG_ON(GUC_PF_MSG_LEN_DW >
> > XE_PAGEFAULT_PRODUCER_MSG_LEN_DW);
> > +
> > +       if (len != GUC_PF_MSG_LEN_DW)
> > +               return -EPROTO;
> > +
> > +       pf.gt = guc_to_gt(guc);
> > +
> > +       /*
> > +        * XXX: These values happen to match the enum in
> > xe_pagefault_types.h.
> > +        * If that changes, we’ll need to remap them here.
> > +        */
> > +       pf.consumer.page_addr = (u64)(FIELD_GET(PFD_VIRTUAL_ADDR_HI,
> > msg[3])
> > +                                     << PFD_VIRTUAL_ADDR_HI_SHIFT) |
> > +               (FIELD_GET(PFD_VIRTUAL_ADDR_LO, msg[2]) <<
> > +                PFD_VIRTUAL_ADDR_LO_SHIFT);
> > +       pf.consumer.asid = FIELD_GET(PFD_ASID, msg[1]);
> > +       pf.consumer.access_type = FIELD_GET(PFD_ACCESS_TYPE,
> > msg[2]);;
> > +       pf.consumer.fault_type = FIELD_GET(PFD_FAULT_TYPE, msg[2]);
> > +       if (FIELD_GET(XE2_PFD_TRVA_FAULT, msg[0]))
> > +               pf.consumer.fault_level = XE_PAGEFAULT_LEVEL_NACK;
> 
> We have a comment in the current implementation that says "sw isn't
> expected to handle trtt faults". At a minimum it would be nice to keep
> that here.
> 
> But really it would be nice to have a little documentation here as to
> *why* we don't care about these types of faults. Should we print
> something if this shows up, at least for debug?
> 

I can add a comment but why we don't expect these faults, I really have
no idea. I believe I copied this code from the i915 without any real
thought.

> > +       else
> > +               pf.consumer.fault_level = FIELD_GET(PFD_FAULT_LEVEL,
> > msg[0]);
> > +       pf.consumer.engine_class = FIELD_GET(PFD_ENG_CLASS, msg[0]);
> 
> Again I think we should log the instance here as well.
> 

Sure.

Matt

> Thanks,
> Stuart
> 
> 
> > +
> > +       pf.producer.private = guc;
> > +       pf.producer.ops = &guc_pagefault_ops;
> > +       for (i = 0; i < GUC_PF_MSG_LEN_DW; ++i)
> > +               pf.producer.msg[i] = msg[i];
> > +
> > +#undef GUC_PF_MSG_LEN_DW
> > +
> > +       return xe_pagefault_handler(guc_to_xe(guc), &pf);
> > +}
> > diff --git a/drivers/gpu/drm/xe/xe_guc_pagefault.h
> > b/drivers/gpu/drm/xe/xe_guc_pagefault.h
> > new file mode 100644
> > index 000000000000..0723f57b8ea9
> > --- /dev/null
> > +++ b/drivers/gpu/drm/xe/xe_guc_pagefault.h
> > @@ -0,0 +1,13 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2025 Intel Corporation
> > + */
> > +
> > +#ifndef _XE_GUC_PAGEFAULT_H_
> > +#define _XE_GUC_PAGEFAULT_H_
> > +
> > +#include <linux/types.h>
> > +
> > +int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len);
> > +
> > +#endif
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > index 10c8a1bcb86e..1bcf3ba3b350 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -109,8 +109,7 @@ xe_svm_garbage_collector_add_range(struct xe_vm
> > *vm, struct xe_svm_range *range,
> >                               &vm->svm.garbage_collector.range_list);
> >         spin_unlock(&vm->svm.garbage_collector.lock);
> >  
> > -       queue_work(xe_device_get_root_tile(xe)->primary_gt-
> > >usm.pf_wq,
> > -                  &vm->svm.garbage_collector.work);
> > +       queue_work(xe->usm.pf_wq, &vm->svm.garbage_collector.work);
> >  }
> >  
> >  static u8
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index 432ea325677d..c9ae13c32117 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -27,7 +27,6 @@
> >  #include "xe_device.h"
> >  #include "xe_drm_client.h"
> >  #include "xe_exec_queue.h"
> > -#include "xe_gt_pagefault.h"
> >  #include "xe_gt_tlb_invalidation.h"
> >  #include "xe_migrate.h"
> >  #include "xe_pat.h"
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/11] drm/xe: Thread prefetch of SVM ranges
  2025-08-28 22:55   ` Summers, Stuart
@ 2025-08-29  1:06     ` Matthew Brost
  0 siblings, 0 replies; 51+ messages in thread
From: Matthew Brost @ 2025-08-29  1:06 UTC (permalink / raw)
  To: Summers, Stuart
  Cc: intel-xe@lists.freedesktop.org, Mrozek, Michal,
	Ghimiray, Himal Prasad, thomas.hellstrom@linux.intel.com,
	Dugast, Francois

On Thu, Aug 28, 2025 at 04:55:20PM -0600, Summers, Stuart wrote:
> On Tue, 2025-08-05 at 23:22 -0700, Matthew Brost wrote:
> > The migrate_vma_* functions are very CPU-intensive; as a result,
> > prefetching SVM ranges is limited by CPU performance rather than
> > paging
> > copy engine bandwidth. To accelerate SVM range prefetching, the step
> > that calls migrate_vma_* is now threaded. Reuses the page fault work
> > queue for threading.
> > 
> > Running xe_exec_system_allocator --r prefetch-benchmark, which tests
> > 64MB prefetches, shows an increase from ~4.35 GB/s to 12.25 GB/s with
> > this patch on drm-tip. Enabling high SLPC further increases
> > throughput
> > to ~15.25 GB/s, and combining SLPC with ULLS raises it to ~16 GB/s.
> > Both
> > of these optimizations are upcoming.
> > 
> > v2:
> >  - Use dedicated prefetch workqueue
> >  - Pick dedicated prefetch thread count based on profiling
> >  - Skip threaded prefetch for only 1 range or if prefetching to SRAM
> >  - Fully tested
> > v3:
> >  - Use page fault work queue
> > 
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_pagefault.c |  30 ++++++-
> >  drivers/gpu/drm/xe/xe_svm.c       |  17 +++-
> >  drivers/gpu/drm/xe/xe_vm.c        | 144 +++++++++++++++++++++++-----
> > --
> >  3 files changed, 152 insertions(+), 39 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
> > b/drivers/gpu/drm/xe/xe_pagefault.c
> > index 95d2eb8566fb..f11c70ca6dd9 100644
> > --- a/drivers/gpu/drm/xe/xe_pagefault.c
> > +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> > @@ -177,7 +177,17 @@ static int xe_pagefault_service(struct
> > xe_pagefault *pf)
> >         if (IS_ERR(vm))
> >                 return PTR_ERR(vm);
> >  
> > -       down_read(&vm->lock);
> > +       /*
> > +        * We can't block threaded prefetches from completing.
> > down_read() can
> > +        * block on a pending down_write(), so without a trylock
> > here, we could
> > +        * deadlock, since the page fault workqueue is shared with
> > prefetches,
> > +        * prefetches flush work items onto the same workqueue, and a
> > +        * down_write() could be pending.
> > +        */
> > +       if (!down_read_trylock(&vm->lock)) {
> > +               err = -EAGAIN;
> > +               goto put_vm;
> > +       }
> >  
> >         if (xe_vm_is_closed(vm)) {
> >                 err = -ENOENT;
> > @@ -202,11 +212,23 @@ static int xe_pagefault_service(struct
> > xe_pagefault *pf)
> >         if (!err)
> >                 vm->usm.last_fault_vma = vma;
> >         up_read(&vm->lock);
> > +put_vm:
> >         xe_vm_put(vm);
> >  
> >         return err;
> >  }
> >  
> > +static void xe_pagefault_queue_retry(struct xe_pagefault_queue
> > *pf_queue,
> > +                                    struct xe_pagefault *pf)
> > +{
> > +       spin_lock_irq(&pf_queue->lock);
> > +       if (!pf_queue->tail)
> > +               pf_queue->tail = pf_queue->size -
> > xe_pagefault_entry_size();
> > +       else
> > +               pf_queue->tail -= xe_pagefault_entry_size();
> > +       spin_unlock_irq(&pf_queue->lock);
> > +}
> > +
> >  static bool xe_pagefault_queue_pop(struct xe_pagefault_queue
> > *pf_queue,
> >                                    struct xe_pagefault *pf)
> >  {
> > @@ -259,7 +281,11 @@ static void xe_pagefault_queue_work(struct
> > work_struct *w)
> >                         continue;
> >  
> >                 err = xe_pagefault_service(&pf);
> > -               if (err) {
> > +               if (err == -EAGAIN) {
> > +                       xe_pagefault_queue_retry(pf_queue, &pf);
> > +                       queue_work(gt_to_xe(pf.gt)->usm.pf_wq, w);
> > +                       break;
> > +               } else if (err) {
> >                         xe_pagefault_print(&pf);
> >                         xe_gt_dbg(pf.gt, "Fault response:
> > Unsuccessful %pe\n",
> >                                   ERR_PTR(err));
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > index 6e5d9ce7c76e..069ede2c7991 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -306,8 +306,19 @@ static void
> > xe_svm_garbage_collector_work_func(struct work_struct *w)
> >         struct xe_vm *vm = container_of(w, struct xe_vm,
> >                                         svm.garbage_collector.work);
> >  
> > -       guard(rwsem_read)(&vm->lock);
> > -       xe_svm_garbage_collector(vm);
> > +       /*
> > +        * We can't block threaded prefetches from completing.
> > down_read() can
> > +        * block on a pending down_write(), so without a trylock
> > here, we could
> > +        * deadlock, since the page fault workqueue is shared with
> > prefetches,
> > +        * prefetches flush work items onto the same workqueue, and a
> > +        * down_write() could be pending.
> > +        */
> > +       if (down_read_trylock(&vm->lock)) {
> > +               xe_svm_garbage_collector(vm);
> > +               up_read(&vm->lock);
> > +       } else {
> > +               queue_work(vm->xe->usm.pf_wq, &vm-
> > >svm.garbage_collector.work);
> > +       }
> >  }
> >  
> >  #if IS_ENABLED(CONFIG_DRM_XE_PAGEMAP)
> > @@ -1148,5 +1159,5 @@ int xe_devm_add(struct xe_tile *tile, struct
> > xe_vram_region *vr)
> >  void xe_svm_flush(struct xe_vm *vm)
> >  {
> >         if (xe_vm_in_fault_mode(vm))
> > -               flush_work(&vm->svm.garbage_collector.work);
> > +               __flush_workqueue(vm->xe->usm.pf_wq);
> >  }
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index 3211827ef6d7..147b900b1f0b 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -2962,57 +2962,132 @@ static int check_ufence(struct xe_vma *vma)
> >         return 0;
> >  }
> >  
> > -static int prefetch_ranges(struct xe_vm *vm, struct xe_vma_op *op)
> > +struct prefetch_thread {
> > +       struct work_struct work;
> > +       struct drm_gpusvm_ctx *ctx;
> > +       struct xe_vma *vma;
> > +       struct xe_svm_range *svm_range;
> > +       struct xe_tile *tile;
> > +       u32 region;
> > +       int err;
> > +};
> > +
> > +static void prefetch_thread_func(struct prefetch_thread *thread)
> >  {
> > -       bool devmem_possible = IS_DGFX(vm->xe) &&
> > IS_ENABLED(CONFIG_DRM_XE_PAGEMAP);
> > -       struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
> > +       struct xe_vma *vma = thread->vma;
> > +       struct xe_vm *vm = xe_vma_vm(vma);
> > +       struct xe_svm_range *svm_range = thread->svm_range;
> > +       u32 region = thread->region;
> > +       struct xe_tile *tile = thread->tile;
> >         int err = 0;
> >  
> > -       struct xe_svm_range *svm_range;
> > +       guard(mutex)(&svm_range->lock);
> > +
> > +       if (xe_svm_range_is_removed(svm_range)) {
> > +               thread->err = -ENODATA;
> > +               return;
> > +       }
> > +
> > +       if (!region) {
> > +               xe_svm_range_migrate_to_smem(vm, svm_range);
> > +       } else if (xe_svm_range_needs_migrate_to_vram(svm_range, vma,
> > region)) {
> > +               err = xe_svm_alloc_vram(tile, svm_range, thread-
> > >ctx);
> > +               if (err) {
> > +                       drm_dbg(&vm->xe->drm,
> > +                               "VRAM allocation failed, retry from
> > userspace, asid=%u, gpusvm=%p, errno=%pe\n",
> > +                               vm->usm.asid, &vm->svm.gpusvm,
> > ERR_PTR(err));
> > +                       thread->err = -ENODATA;
> > +                       return;
> > +               }
> > +               xe_svm_range_debug(svm_range, "PREFETCH - RANGE
> > MIGRATED TO VRAM");
> > +       }
> > +
> > +       err = xe_svm_range_get_pages(vm, svm_range, thread->ctx);
> > +       if (err) {
> > +               drm_dbg(&vm->xe->drm, "Get pages failed, asid=%u,
> > gpusvm=%p, errno=%pe\n",
> > +                       vm->usm.asid, &vm->svm.gpusvm, ERR_PTR(err));
> > +               if (err == -EOPNOTSUPP || err == -EFAULT || err == -
> > EPERM)
> > +                       err = -ENODATA;
> > +               thread->err = err;
> > +               return;
> > +       }
> > +
> > +       xe_svm_range_debug(svm_range, "PREFETCH - RANGE GET PAGES
> > DONE");
> > +}
> > +
> > +static void prefetch_work_func(struct work_struct *w)
> > +{
> > +       struct prefetch_thread *thread =
> > +               container_of(w, struct prefetch_thread, work);
> > +
> > +       prefetch_thread_func(thread);
> > +}
> > +
> > +static int prefetch_ranges(struct xe_vm *vm, struct xe_vma_ops
> > *vops,
> > +                          struct xe_vma_op *op)
> > +{
> > +       struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
> > +       u32 region = op->prefetch_range.region;
> >         struct drm_gpusvm_ctx ctx = {};
> > -       struct xe_tile *tile;
> > +       struct prefetch_thread stack_thread;
> > +       struct xe_svm_range *svm_range;
> > +       struct prefetch_thread *prefetches;
> > +       bool sram = region_to_mem_type[region] == XE_PL_TT;
> > +       struct xe_tile *tile = sram ? xe_device_get_root_tile(vm->xe)
> > :
> > +               &vm->xe->tiles[region_to_mem_type[region] -
> > XE_PL_VRAM0];
> >         unsigned long i;
> > -       u32 region;
> > +       bool devmem_possible = IS_DGFX(vm->xe) &&
> > +               IS_ENABLED(CONFIG_DRM_XE_PAGEMAP);
> > +       bool skip_threads = op->prefetch_range.ranges_count == 1 ||
> > sram ||
> 
> Starting to work through these.. shouldn't we also allow the user to
> opportunistically skip this (cgroup/sysfs/etc)? I realize the
> microbenchmark shows some improvement, but some of the workloads might

Prefetch without this on tip is so slow, no one would ever use it.

> also be much more heavy on the CPU side and we don't want to throttle
> that with the extra kernel threads if they aren't heavy on the fault
> side.
> 

The next patch adds this, via debugfs. We could make it a bit more
offical via sysfs or configfs eventually.

Also once we land THP device pages, all we need is 2 threads on BMG as
the CPU time of a 2M prefetch goes from ~350us to 10us. This should
scale to a CPU <-> GPU bus 8x faster - by scale I mean we can hit peak
bandwidth on the bus. Also once THP device pages land, most of what
prefetch threads are doing is just sleeping waiting for the copy to
complete, so CPU is free to do other things. Hopefully we that in 6.19
timeframe.

Matt 

> Thanks,
> Stuart
> 
> > +               !(vops->flags & XE_VMA_OPS_FLAG_DOWNGRADE_LOCK);
> > +       struct prefetch_thread *thread = skip_threads ? &stack_thread
> > : NULL;
> > +       int err = 0, idx = 0;
> >  
> >         if (!xe_vma_is_cpu_addr_mirror(vma))
> >                 return 0;
> >  
> > -       region = op->prefetch_range.region;
> > +       if (!skip_threads) {
> > +               prefetches = kvmalloc_array(op-
> > >prefetch_range.ranges_count,
> > +                                           sizeof(*prefetches),
> > GFP_KERNEL);
> > +               if (!prefetches)
> > +                       return -ENOMEM;
> > +       }
> >  
> >         ctx.read_only = xe_vma_read_only(vma);
> >         ctx.devmem_possible = devmem_possible;
> >         ctx.check_pages_threshold = devmem_possible ? SZ_64K : 0;
> >  
> > -       /* TODO: Threading the migration */
> >         xa_for_each(&op->prefetch_range.range, i, svm_range) {
> > -               guard(mutex)(&svm_range->lock);
> > -
> > -               if (xe_svm_range_is_removed(svm_range))
> > -                       return -ENODATA;
> > -
> > -               if (!region)
> > -                       xe_svm_range_migrate_to_smem(vm, svm_range);
> > +               if (!skip_threads) {
> > +                       thread = prefetches + idx++;
> > +                       INIT_WORK(&thread->work, prefetch_work_func);
> > +               }
> >  
> > -               if (xe_svm_range_needs_migrate_to_vram(svm_range,
> > vma, region)) {
> > -                       tile = &vm->xe-
> > >tiles[region_to_mem_type[region] - XE_PL_VRAM0];
> > -                       err = xe_svm_alloc_vram(tile, svm_range,
> > &ctx);
> > -                       if (err) {
> > -                               drm_dbg(&vm->xe->drm, "VRAM
> > allocation failed, retry from userspace, asid=%u, gpusvm=%p,
> > errno=%pe\n",
> > -                                       vm->usm.asid, &vm-
> > >svm.gpusvm, ERR_PTR(err));
> > -                               return -ENODATA;
> > -                       }
> > -                       xe_svm_range_debug(svm_range, "PREFETCH -
> > RANGE MIGRATED TO VRAM");
> > +               thread->ctx = &ctx;
> > +               thread->vma = vma;
> > +               thread->svm_range = svm_range;
> > +               thread->tile = tile;
> > +               thread->region = region;
> > +               thread->err = 0;
> > +
> > +               if (skip_threads) {
> > +                       prefetch_thread_func(thread);
> > +                       if (thread->err)
> > +                               return thread->err;
> > +               } else {
> > +                       queue_work(vm->xe->usm.pf_wq, &thread->work);
> >                 }
> > +       }
> >  
> > -               err = xe_svm_range_get_pages(vm, svm_range, &ctx);
> > -               if (err) {
> > -                       drm_dbg(&vm->xe->drm, "Get pages failed,
> > asid=%u, gpusvm=%p, errno=%pe\n",
> > -                               vm->usm.asid, &vm->svm.gpusvm,
> > ERR_PTR(err));
> > -                       if (err == -EOPNOTSUPP || err == -EFAULT ||
> > err == -EPERM)
> > -                               err = -ENODATA;
> > -                       return err;
> > +       if (!skip_threads) {
> > +               for (i = 0; i < idx; ++i) {
> > +                       thread = prefetches + i;
> > +
> > +                       flush_work(&thread->work);
> > +                       if (thread->err && (!err || err == -ENODATA))
> > +                               err = thread->err;
> >                 }
> > -               xe_svm_range_debug(svm_range, "PREFETCH - RANGE GET
> > PAGES DONE");
> > +               kvfree(prefetches);
> >         }
> >  
> >         return err;
> > @@ -3079,7 +3154,8 @@ static int op_lock_and_prep(struct drm_exec
> > *exec, struct xe_vm *vm,
> >         return err;
> >  }
> >  
> > -static int vm_bind_ioctl_ops_prefetch_ranges(struct xe_vm *vm,
> > struct xe_vma_ops *vops)
> > +static int vm_bind_ioctl_ops_prefetch_ranges(struct xe_vm *vm,
> > +                                            struct xe_vma_ops *vops)
> >  {
> >         struct xe_vma_op *op;
> >         int err;
> > @@ -3089,7 +3165,7 @@ static int
> > vm_bind_ioctl_ops_prefetch_ranges(struct xe_vm *vm, struct xe_vma_ops
> >  
> >         list_for_each_entry(op, &vops->list, link) {
> >                 if (op->base.op  == DRM_GPUVA_OP_PREFETCH) {
> > -                       err = prefetch_ranges(vm, op);
> > +                       err = prefetch_ranges(vm, vops, op);
> >                         if (err)
> >                                 return err;
> >                 }
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2025-08-29  1:06 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-06  6:22 [PATCH 00/11] Pagefault refactor, fine grained fault locking, threaded prefetch Matthew Brost
2025-08-06  6:22 ` [PATCH 01/11] drm/xe: Stub out new pagefault layer Matthew Brost
2025-08-06 23:01   ` Summers, Stuart
2025-08-06 23:53     ` Matthew Brost
2025-08-07 17:20       ` Summers, Stuart
2025-08-07 18:10         ` Matthew Brost
2025-08-28 20:18           ` Summers, Stuart
2025-08-28 20:20             ` Matthew Brost
2025-08-27 15:29   ` Francois Dugast
2025-08-27 16:03     ` Matthew Brost
2025-08-27 16:25       ` Francois Dugast
2025-08-27 16:40         ` Matthew Brost
2025-08-27 18:00       ` Matthew Brost
2025-08-28 20:08   ` Summers, Stuart
2025-08-06  6:22 ` [PATCH 02/11] drm/xe: Implement xe_pagefault_init Matthew Brost
2025-08-06 23:08   ` Summers, Stuart
2025-08-06 23:59     ` Matthew Brost
2025-08-07 18:22       ` Summers, Stuart
2025-08-27 16:30   ` Francois Dugast
2025-08-27 16:49     ` Matthew Brost
2025-08-28 20:10   ` Summers, Stuart
2025-08-28 20:14     ` Matthew Brost
2025-08-28 20:19       ` Summers, Stuart
2025-08-06  6:22 ` [PATCH 03/11] drm/xe: Implement xe_pagefault_reset Matthew Brost
2025-08-06 23:16   ` Summers, Stuart
2025-08-07  0:12     ` Matthew Brost
2025-08-07 18:29       ` Summers, Stuart
2025-08-06  6:22 ` [PATCH 04/11] drm/xe: Implement xe_pagefault_handler Matthew Brost
2025-08-28 11:26   ` Francois Dugast
2025-08-28 20:24   ` Summers, Stuart
2025-08-06  6:22 ` [PATCH 05/11] drm/xe: Implement xe_pagefault_queue_work Matthew Brost
2025-08-28 12:29   ` Francois Dugast
2025-08-28 18:39     ` Matthew Brost
2025-08-28 22:04   ` Summers, Stuart
2025-08-29  0:51     ` Matthew Brost
2025-08-06  6:22 ` [PATCH 06/11] drm/xe: Add xe_guc_pagefault layer Matthew Brost
2025-08-28 13:27   ` Francois Dugast
2025-08-28 18:38     ` Matthew Brost
2025-08-28 22:11   ` Summers, Stuart
2025-08-29  0:54     ` Matthew Brost
2025-08-06  6:22 ` [PATCH 07/11] drm/xe: Remove unused GT page fault code Matthew Brost
2025-08-28 19:13   ` Summers, Stuart
2025-08-06  6:22 ` [PATCH 08/11] drm/xe: Fine grained page fault locking Matthew Brost
2025-08-06  6:22 ` [PATCH 09/11] drm/xe: Allow prefetch-only VM bind IOCTLs to use VM read lock Matthew Brost
2025-08-06  6:22 ` [PATCH 10/11] drm/xe: Thread prefetch of SVM ranges Matthew Brost
2025-08-28 22:55   ` Summers, Stuart
2025-08-29  1:06     ` Matthew Brost
2025-08-06  6:22 ` [PATCH 11/11] drm/xe: Add num_pf_queue modparam Matthew Brost
2025-08-28 22:58   ` Summers, Stuart
2025-08-06  6:36 ` ✗ CI.checkpatch: warning for Pagefault refactor, fine grained fault locking, threaded prefetch Patchwork
2025-08-06  6:36 ` ✗ CI.KUnit: failure " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).