[RFC PATCH 0/6] Multi-pass MMU interval notifiers

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/6] Multi-pass MMU interval notifiers
@ 2025-08-09 13:51 Thomas Hellström
  2025-08-09 13:51 ` [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes Thomas Hellström
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Thomas Hellström @ 2025-08-09 13:51 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Matthew Brost, Jason Gunthorpe,
	Andrew Morton, Simona Vetter, Dave Airlie, dri-devel, linux-mm,
	linux-kernel, Christian König

GPU use-cases for mmu_interval_notifiers with hmm often involve
starting a gpu operation and then waiting for it to complete.
These operations are typically context preemption or TLB flushing.

With single-pass notifiers per GPU this doesn't scale in
multi-gpu scenarios. In those scenarios we'd want to first start
preemption- or TLB flushing on all GPUs and as a second pass wait
for them to complete on all GPUs.

One can do this on per-driver basis multiplexing per-driver
notifiers but that would mean sharing the notifier "user" lock
across all GPUs and that doesn't scale well either, so adding support
for multi-pass in the core appears like the right choice.

So this series does that, with pach 1 implementing the core support
and also describes the choices made.
The rest of the patches implements a POC with drm_gpusvm, but this
will also come in handy for things like userptr where waiting for
bind completion, starting of preemption and waiting for
preemption completion can pe pipelined across GPUs.

Any feedback or suggestions for alternative approches appreciated.

Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: <dri-devel@lists.freedesktop.org>
Cc: <linux-mm@kvack.org>
Cc: <linux-kernel@vger.kernel.org>

Matthew Brost (5):
  drm/gpusvm: Update GPU SVM / Xe to twopass MMU notifier
  drm/gpusvm: Add drm_gpusvm_in_notifier_* helpers
  drm/xe: Skip waiting on unarmed fences in
    xe_gt_tlb_invalidation_fence_wait
  drm/xe: Add fences argument to xe_vm_range_tilemask_tlb_invalidation
  drm/xe: Implement two pass MMU notifiers for SVM

Thomas Hellström (1):
  mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes

 drivers/gpu/drm/drm_gpusvm.c                | 18 +++--
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h |  3 +-
 drivers/gpu/drm/xe/xe_svm.c                 | 84 +++++++++++++++++----
 drivers/gpu/drm/xe/xe_vm.c                  | 26 ++++---
 drivers/gpu/drm/xe/xe_vm.h                  |  6 +-
 include/drm/drm_gpusvm.h                    | 33 ++++++--
 include/linux/mmu_notifier.h                | 30 ++++++++
 mm/mmu_notifier.c                           | 67 +++++++++++++---
 8 files changed, 217 insertions(+), 50 deletions(-)

-- 
2.50.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes
  2025-08-09 13:51 [RFC PATCH 0/6] Multi-pass MMU interval notifiers Thomas Hellström
@ 2025-08-09 13:51 ` Thomas Hellström
  2025-08-18 16:07   ` Jason Gunthorpe
  2025-08-19 10:03   ` Alistair Popple
  2025-08-09 13:51 ` [RFC PATCH 2/6] drm/gpusvm: Update GPU SVM / Xe to twopass MMU notifier Thomas Hellström
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 22+ messages in thread
From: Thomas Hellström @ 2025-08-09 13:51 UTC (permalink / raw)
  To: intel-xe
  Cc: Thomas Hellström, Jason Gunthorpe, Andrew Morton,
	Simona Vetter, Dave Airlie, dri-devel, linux-mm, linux-kernel,
	Matthew Brost, Christian König

GPU use-cases for mmu_interval_notifiers with hmm often involve
starting a gpu operation and then waiting for it to complete.
These operations are typically context preemption or TLB flushing.

With single-pass notifiers per GPU this doesn't scale in
multi-gpu scenarios. In those scenarios we'd want to first start
preemption- or TLB flushing on all GPUs and as a second pass wait
for them to complete on all gpus.

One can do this on per-driver basis multiplexing per-driver
notifiers but that would mean sharing the notifier "user" lock
across all GPUs and that doesn't scale well either, so adding support
for multi-pass in the core appears like the right choice.

Implement multi-pass capability in the mmu_interval_notifier. Use a
linked list for the additional passes to minimize the impact for
use-cases that don't need the multi-pass functionality.

Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: <dri-devel@lists.freedesktop.org>
Cc: <linux-mm@kvack.org>
Cc: <linux-kernel@vger.kernel.org>

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 include/linux/mmu_notifier.h | 30 ++++++++++++++++
 mm/mmu_notifier.c            | 67 +++++++++++++++++++++++++++++++-----
 2 files changed, 88 insertions(+), 9 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d1094c2d5fb6..1107a8eafd8a 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -233,6 +233,32 @@ struct mmu_notifier {
 	unsigned int users;
 };
 
+/**
+ * struct mmu_interval_notifier_pass - mmu_interval_notifier multi-pass abstraction
+ * @link: List link for the notifiers pending pass list
+ *
+ * Allocate, typically using GFP_NOWAIT in the interval notifier's first pass.
+ * If allocation fails (which is not unlikely under memory pressure), fall back
+ * to single-pass operation.
+ */
+struct mmu_interval_notifier_pass {
+	struct list_head link;
+	/**
+	 * @pass: Driver callback for additionall pass.
+	 * @additional_pass: Pointer to the mmu_interval_notifier_pass structure.
+	 * @range: The mmu_notifier_range.
+	 * @cur_seq: The current sequence set by the first pass.
+	 *
+	 * Return: Either a pointer to a valid mmu_interval_notifier_pass for
+	 * another pass to be called, or %NULL if processing is complete for this
+	 * notifier. There is no error reporting mechanism for additional passes.
+	 */
+	struct mmu_interval_notifier_pass *
+	(*pass) (struct mmu_interval_notifier_pass *additional_pass,
+		 const struct mmu_notifier_range *range,
+		 unsigned long cur_seq);
+};
+
 /**
  * struct mmu_interval_notifier_ops
  * @invalidate: Upon return the caller must stop using any SPTEs within this
@@ -243,6 +269,10 @@ struct mmu_interval_notifier_ops {
 	bool (*invalidate)(struct mmu_interval_notifier *interval_sub,
 			   const struct mmu_notifier_range *range,
 			   unsigned long cur_seq);
+	bool (*invalidate_multipass)(struct mmu_interval_notifier *interval_sub,
+				     const struct mmu_notifier_range *range,
+				     unsigned long cur_seq,
+				     struct mmu_interval_notifier_pass **pass);
 };
 
 struct mmu_interval_notifier {
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 8e0125dc0522..dd6af87db103 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -260,6 +260,22 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub)
 }
 EXPORT_SYMBOL_GPL(mmu_interval_read_begin);
 
+static void mn_itree_additional_passes(struct list_head *additional_passes,
+				       const struct mmu_notifier_range *range,
+				       unsigned long cur_seq)
+{
+	struct mmu_interval_notifier_pass *p, *next;
+
+	while (!list_empty(additional_passes)) {
+		list_for_each_entry_safe(p, next, additional_passes, link) {
+			list_del_init(&p->link);
+			p = p->pass(p, range, cur_seq);
+			if (p)
+				list_add_tail(&p->link, additional_passes);
+		}
+	}
+}
+
 static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions,
 			     struct mm_struct *mm)
 {
@@ -272,17 +288,32 @@ static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions,
 	};
 	struct mmu_interval_notifier *interval_sub;
 	unsigned long cur_seq;
+	LIST_HEAD(additional_passes);
 	bool ret;
 
 	for (interval_sub =
 		     mn_itree_inv_start_range(subscriptions, &range, &cur_seq);
 	     interval_sub;
 	     interval_sub = mn_itree_inv_next(interval_sub, &range)) {
-		ret = interval_sub->ops->invalidate(interval_sub, &range,
-						    cur_seq);
+		if (interval_sub->ops->invalidate_multipass) {
+			struct mmu_interval_notifier_pass *second = NULL;
+
+			ret = interval_sub->ops->invalidate_multipass(interval_sub,
+								      &range,
+								      cur_seq,
+								      &second);
+			if (ret && second)
+				list_add_tail(&second->link, &additional_passes);
+
+		} else {
+			ret = interval_sub->ops->invalidate(interval_sub,
+							    &range,
+							    cur_seq);
+		}
 		WARN_ON(!ret);
 	}
 
+	mn_itree_additional_passes(&additional_passes, &range, cur_seq);
 	mn_itree_inv_end(subscriptions);
 }
 
@@ -431,6 +462,8 @@ static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions,
 {
 	struct mmu_interval_notifier *interval_sub;
 	unsigned long cur_seq;
+	LIST_HEAD(additional_passes);
+	int err = 0;
 
 	for (interval_sub =
 		     mn_itree_inv_start_range(subscriptions, range, &cur_seq);
@@ -438,23 +471,39 @@ static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions,
 	     interval_sub = mn_itree_inv_next(interval_sub, range)) {
 		bool ret;
 
-		ret = interval_sub->ops->invalidate(interval_sub, range,
-						    cur_seq);
+		if (interval_sub->ops->invalidate_multipass) {
+			struct mmu_interval_notifier_pass *second = NULL;
+
+			ret = interval_sub->ops->invalidate_multipass(interval_sub,
+								      range,
+								      cur_seq,
+								      &second);
+			if (ret && second)
+				list_add_tail(&second->link, &additional_passes);
+
+		} else {
+			ret = interval_sub->ops->invalidate(interval_sub,
+							    range,
+							    cur_seq);
+		}
 		if (!ret) {
 			if (WARN_ON(mmu_notifier_range_blockable(range)))
 				continue;
-			goto out_would_block;
+			err = -EAGAIN;
+			break;
 		}
 	}
-	return 0;
 
-out_would_block:
+	mn_itree_additional_passes(&additional_passes, range, cur_seq);
+
 	/*
 	 * On -EAGAIN the non-blocking caller is not allowed to call
 	 * invalidate_range_end()
 	 */
-	mn_itree_inv_end(subscriptions);
-	return -EAGAIN;
+	if (err)
+		mn_itree_inv_end(subscriptions);
+
+	return err;
 }
 
 static int mn_hlist_invalidate_range_start(
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 2/6] drm/gpusvm: Update GPU SVM / Xe to twopass MMU notifier
  2025-08-09 13:51 [RFC PATCH 0/6] Multi-pass MMU interval notifiers Thomas Hellström
  2025-08-09 13:51 ` [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes Thomas Hellström
@ 2025-08-09 13:51 ` Thomas Hellström
  2025-08-09 13:51 ` [RFC PATCH 3/6] drm/gpusvm: Add drm_gpusvm_in_notifier_* helpers Thomas Hellström
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Thomas Hellström @ 2025-08-09 13:51 UTC (permalink / raw)
  To: intel-xe
  Cc: Matthew Brost, Christian König, dri-devel, Jason Gunthorpe,
	Andrew Morton, Simona Vetter, Dave Airlie, linux-mm, linux-kernel

From: Matthew Brost <matthew.brost@intel.com>

Update GPU SVM and Xe to use two-pass MMU notifiers, enabling pipelined
TLB invalidations across VMs or multiple devices.

The driver-side (Xe) implementation is not yet implemented.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/drm_gpusvm.c | 18 +++++++++++-------
 drivers/gpu/drm/xe/xe_svm.c  |  9 +++++----
 include/drm/drm_gpusvm.h     | 11 +++++++----
 3 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/drm_gpusvm.c b/drivers/gpu/drm/drm_gpusvm.c
index 661306da6b2d..92dc7d2bd6cf 100644
--- a/drivers/gpu/drm/drm_gpusvm.c
+++ b/drivers/gpu/drm/drm_gpusvm.c
@@ -374,10 +374,13 @@ notifier_iter_first(struct rb_root_cached *root, unsigned long start,
 	     (notifier__) = (next__), (next__) = __drm_gpusvm_notifier_next(notifier__))
 
 /**
- * drm_gpusvm_notifier_invalidate() - Invalidate a GPU SVM notifier.
+ * drm_gpusvm_notifier_invalidate_twopass() - Invalidate a GPU SVM notifie,
+ * fist pass.
+ *
  * @mni: Pointer to the mmu_interval_notifier structure.
  * @mmu_range: Pointer to the mmu_notifier_range structure.
  * @cur_seq: Current sequence number.
+ * @pass: First pass of MMU notifier
  *
  * This function serves as a generic MMU notifier for GPU SVM. It sets the MMU
  * notifier sequence number and calls the driver invalidate vfunc under
@@ -386,9 +389,10 @@ notifier_iter_first(struct rb_root_cached *root, unsigned long start,
  * Return: true if the operation succeeds, false otherwise.
  */
 static bool
-drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
-			       const struct mmu_notifier_range *mmu_range,
-			       unsigned long cur_seq)
+drm_gpusvm_notifier_invalidate_twopass(struct mmu_interval_notifier *mni,
+				       const struct mmu_notifier_range *mmu_range,
+				       unsigned long cur_seq,
+				       struct mmu_interval_notifier_pass **pass)
 {
 	struct drm_gpusvm_notifier *notifier =
 		container_of(mni, typeof(*notifier), notifier);
@@ -399,7 +403,7 @@ drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
 
 	down_write(&gpusvm->notifier_lock);
 	mmu_interval_set_seq(mni, cur_seq);
-	gpusvm->ops->invalidate(gpusvm, notifier, mmu_range);
+	gpusvm->ops->invalidate_twopass(gpusvm, notifier, mmu_range, pass);
 	up_write(&gpusvm->notifier_lock);
 
 	return true;
@@ -409,7 +413,7 @@ drm_gpusvm_notifier_invalidate(struct mmu_interval_notifier *mni,
  * drm_gpusvm_notifier_ops - MMU interval notifier operations for GPU SVM
  */
 static const struct mmu_interval_notifier_ops drm_gpusvm_notifier_ops = {
-	.invalidate = drm_gpusvm_notifier_invalidate,
+	.invalidate_twopass = drm_gpusvm_notifier_invalidate_twopass,
 };
 
 /**
@@ -440,7 +444,7 @@ int drm_gpusvm_init(struct drm_gpusvm *gpusvm,
 		    const struct drm_gpusvm_ops *ops,
 		    const unsigned long *chunk_sizes, int num_chunks)
 {
-	if (!ops->invalidate || !num_chunks)
+	if (!ops->invalidate_twopass || !num_chunks)
 		return -EINVAL;
 
 	gpusvm->name = name;
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index e35c6d4def20..23c5b363261c 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -171,9 +171,10 @@ xe_svm_range_notifier_event_end(struct xe_vm *vm, struct drm_gpusvm_range *r,
 						   mmu_range);
 }
 
-static void xe_svm_invalidate(struct drm_gpusvm *gpusvm,
-			      struct drm_gpusvm_notifier *notifier,
-			      const struct mmu_notifier_range *mmu_range)
+static void xe_svm_invalidate_twopass(struct drm_gpusvm *gpusvm,
+				      struct drm_gpusvm_notifier *notifier,
+				      const struct mmu_notifier_range *mmu_range,
+				      struct mmu_interval_notifier_pass **p)
 {
 	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
 	struct xe_device *xe = vm->xe;
@@ -553,7 +554,7 @@ static const struct drm_pagemap_devmem_ops dpagemap_devmem_ops = {
 static const struct drm_gpusvm_ops gpusvm_ops = {
 	.range_alloc = xe_svm_range_alloc,
 	.range_free = xe_svm_range_free,
-	.invalidate = xe_svm_invalidate,
+	.invalidate_twopass = xe_svm_invalidate_twopass,
 };
 
 static const unsigned long fault_chunk_sizes[] = {
diff --git a/include/drm/drm_gpusvm.h b/include/drm/drm_gpusvm.h
index 8d613e9b2690..8b5e159857fc 100644
--- a/include/drm/drm_gpusvm.h
+++ b/include/drm/drm_gpusvm.h
@@ -63,17 +63,20 @@ struct drm_gpusvm_ops {
 	void (*range_free)(struct drm_gpusvm_range *range);
 
 	/**
-	 * @invalidate: Invalidate GPU SVM notifier (required)
+	 * @invalidate_twopass: Invalidate first pass GPU SVM notifier (required)
 	 * @gpusvm: Pointer to the GPU SVM
 	 * @notifier: Pointer to the GPU SVM notifier
 	 * @mmu_range: Pointer to the mmu_notifier_range structure
+	 * @pass: Pass of MMU notifier, optionally populated driver side
+	 * if a second pass of MMU notifier is desired
 	 *
 	 * Invalidate the GPU page tables. It can safely walk the notifier range
 	 * RB tree/list in this function. Called while holding the notifier lock.
 	 */
-	void (*invalidate)(struct drm_gpusvm *gpusvm,
-			   struct drm_gpusvm_notifier *notifier,
-			   const struct mmu_notifier_range *mmu_range);
+	void (*invalidate_twopass)(struct drm_gpusvm *gpusvm,
+				   struct drm_gpusvm_notifier *notifier,
+				   const struct mmu_notifier_range *mmu_range,
+				   struct mmu_interval_notifier_pass **pass);
 };
 
 /**
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 3/6] drm/gpusvm: Add drm_gpusvm_in_notifier_* helpers
  2025-08-09 13:51 [RFC PATCH 0/6] Multi-pass MMU interval notifiers Thomas Hellström
  2025-08-09 13:51 ` [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes Thomas Hellström
  2025-08-09 13:51 ` [RFC PATCH 2/6] drm/gpusvm: Update GPU SVM / Xe to twopass MMU notifier Thomas Hellström
@ 2025-08-09 13:51 ` Thomas Hellström
  2025-08-09 13:51 ` [RFC PATCH 4/6] drm/xe: Skip waiting on unarmed fences in xe_gt_tlb_invalidation_fence_wait Thomas Hellström
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Thomas Hellström @ 2025-08-09 13:51 UTC (permalink / raw)
  To: intel-xe
  Cc: Matthew Brost, Christian König, dri-devel, Jason Gunthorpe,
	Andrew Morton, Simona Vetter, Dave Airlie, linux-mm, linux-kernel

From: Matthew Brost <matthew.brost@intel.com>

Abstract drm_gpusvm_in_notifier_lock/unlock with helpers. Intended usage
is a client side 2nd pass of a MMU notifier.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 include/drm/drm_gpusvm.h | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/drm/drm_gpusvm.h b/include/drm/drm_gpusvm.h
index 8b5e159857fc..4bdbe10685cf 100644
--- a/include/drm/drm_gpusvm.h
+++ b/include/drm/drm_gpusvm.h
@@ -313,7 +313,7 @@ void drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
 #endif
 
 /**
- * drm_gpusvm_notifier_lock() - Lock GPU SVM notifier
+ * drm_gpusvm_notifier_lock() - Lock GPU SVM notifier, client side
  * @gpusvm__: Pointer to the GPU SVM structure.
  *
  * Abstract client usage GPU SVM notifier lock, take lock
@@ -322,7 +322,7 @@ void drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
 	down_read(&(gpusvm__)->notifier_lock)
 
 /**
- * drm_gpusvm_notifier_unlock() - Unlock GPU SVM notifier
+ * drm_gpusvm_notifier_unlock() - Unlock GPU SVM notifier, client side
  * @gpusvm__: Pointer to the GPU SVM structure.
  *
  * Abstract client usage GPU SVM notifier lock, drop lock
@@ -330,6 +330,24 @@ void drm_gpusvm_range_set_unmapped(struct drm_gpusvm_range *range,
 #define drm_gpusvm_notifier_unlock(gpusvm__)	\
 	up_read(&(gpusvm__)->notifier_lock)
 
+/**
+ * drm_gpusvm_in_notifier_lock() - Lock GPU SVM notifier, in notifier
+ * @gpusvm__: Pointer to the GPU SVM structure.
+ *
+ * Abstract in notifier (2nd pass) usage GPU SVM notifier lock, take lock
+ */
+#define drm_gpusvm_in_notifier_lock(gpusvm__)	\
+	down_write(&(gpusvm__)->notifier_lock)
+
+/**
+ * drm_gpusvm_in_notifier_unlock() - Unlock GPU SVM notifier, in notifier
+ * @gpusvm__: Pointer to the GPU SVM structure.
+ *
+ * Abstract in notifier (2nd pass) GPU SVM notifier lock, drop lock
+ */
+#define drm_gpusvm_in_notifier_unlock(gpusvm__)	\
+	up_write(&(gpusvm__)->notifier_lock)
+
 /**
  * drm_gpusvm_range_start() - GPU SVM range start address
  * @range: Pointer to the GPU SVM range
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 4/6] drm/xe: Skip waiting on unarmed fences in xe_gt_tlb_invalidation_fence_wait
  2025-08-09 13:51 [RFC PATCH 0/6] Multi-pass MMU interval notifiers Thomas Hellström
                   ` (2 preceding siblings ...)
  2025-08-09 13:51 ` [RFC PATCH 3/6] drm/gpusvm: Add drm_gpusvm_in_notifier_* helpers Thomas Hellström
@ 2025-08-09 13:51 ` Thomas Hellström
  2025-08-09 13:51 ` [RFC PATCH 5/6] drm/xe: Add fences argument to xe_vm_range_tilemask_tlb_invalidation Thomas Hellström
  2025-08-09 13:51 ` [RFC PATCH 6/6] drm/xe: Implement two pass MMU notifiers for SVM Thomas Hellström
  5 siblings, 0 replies; 22+ messages in thread
From: Thomas Hellström @ 2025-08-09 13:51 UTC (permalink / raw)
  To: intel-xe
  Cc: Matthew Brost, Christian König, dri-devel, Jason Gunthorpe,
	Andrew Morton, Simona Vetter, Dave Airlie, linux-mm, linux-kernel

From: Matthew Brost <matthew.brost@intel.com>

Avoids unnecessary waits when the TLB invalidation fence has not been
armed, simplifying caller logic in cases where the fence status is
uncertain.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
index f7f0f2eaf4b5..c6d4398d3429 100644
--- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
+++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h
@@ -34,7 +34,8 @@ void xe_gt_tlb_invalidation_fence_signal(struct xe_gt_tlb_invalidation_fence *fe
 static inline void
 xe_gt_tlb_invalidation_fence_wait(struct xe_gt_tlb_invalidation_fence *fence)
 {
-	dma_fence_wait(&fence->base, false);
+	if (fence->seqno)
+		dma_fence_wait(&fence->base, false);
 }
 
 #endif	/* _XE_GT_TLB_INVALIDATION_ */
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 5/6] drm/xe: Add fences argument to xe_vm_range_tilemask_tlb_invalidation
  2025-08-09 13:51 [RFC PATCH 0/6] Multi-pass MMU interval notifiers Thomas Hellström
                   ` (3 preceding siblings ...)
  2025-08-09 13:51 ` [RFC PATCH 4/6] drm/xe: Skip waiting on unarmed fences in xe_gt_tlb_invalidation_fence_wait Thomas Hellström
@ 2025-08-09 13:51 ` Thomas Hellström
  2025-08-09 13:51 ` [RFC PATCH 6/6] drm/xe: Implement two pass MMU notifiers for SVM Thomas Hellström
  5 siblings, 0 replies; 22+ messages in thread
From: Thomas Hellström @ 2025-08-09 13:51 UTC (permalink / raw)
  To: intel-xe
  Cc: Matthew Brost, Christian König, dri-devel, Jason Gunthorpe,
	Andrew Morton, Simona Vetter, Dave Airlie, linux-mm, linux-kernel

From: Matthew Brost <matthew.brost@intel.com>

Introduce a fences argument to xe_vm_range_tilemask_tlb_invalidation,
allowing callers to provide fences and defer waiting to a later point.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/xe/xe_svm.c |  3 ++-
 drivers/gpu/drm/xe/xe_vm.c  | 26 +++++++++++++++++---------
 drivers/gpu/drm/xe/xe_vm.h  |  6 ++++--
 3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 23c5b363261c..82a598c8d56e 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -226,7 +226,8 @@ static void xe_svm_invalidate_twopass(struct drm_gpusvm *gpusvm,
 
 	xe_device_wmb(xe);
 
-	err = xe_vm_range_tilemask_tlb_invalidation(vm, adj_start, adj_end, tile_mask);
+	err = xe_vm_range_tilemask_tlb_invalidation(vm, NULL, adj_start,
+						    adj_end, tile_mask);
 	WARN_ON_ONCE(err);
 
 range_notifier_event_end:
diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
index 148a2425006f..52242fac6969 100644
--- a/drivers/gpu/drm/xe/xe_vm.c
+++ b/drivers/gpu/drm/xe/xe_vm.c
@@ -3846,6 +3846,7 @@ void xe_vm_unlock(struct xe_vm *vm)
  * xe_vm_range_tilemask_tlb_invalidation - Issue a TLB invalidation on this tilemask for an
  * address range
  * @vm: The VM
+ * @fences: Caller provided fences, caller owns waiting if non-NULL
  * @start: start address
  * @end: end address
  * @tile_mask: mask for which gt's issue tlb invalidation
@@ -3854,10 +3855,12 @@ void xe_vm_unlock(struct xe_vm *vm)
  *
  * Returns 0 for success, negative error code otherwise.
  */
-int xe_vm_range_tilemask_tlb_invalidation(struct xe_vm *vm, u64 start,
-					  u64 end, u8 tile_mask)
+int xe_vm_range_tilemask_tlb_invalidation(struct xe_vm *vm,
+					  struct xe_gt_tlb_invalidation_fence *fences,
+					  u64 start, u64 end, u8 tile_mask)
 {
 	struct xe_gt_tlb_invalidation_fence fence[XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE];
+	struct xe_gt_tlb_invalidation_fence *__fence = fences ?: fence;
 	struct xe_tile *tile;
 	u32 fence_id = 0;
 	u8 id;
@@ -3869,37 +3872,41 @@ int xe_vm_range_tilemask_tlb_invalidation(struct xe_vm *vm, u64 start,
 	for_each_tile(tile, vm->xe, id) {
 		if (tile_mask & BIT(id)) {
 			xe_gt_tlb_invalidation_fence_init(tile->primary_gt,
-							  &fence[fence_id], true);
+							 __fence, true);
 
 			err = xe_gt_tlb_invalidation_range(tile->primary_gt,
-							   &fence[fence_id],
+							   __fence,
 							   start,
 							   end,
 							   vm->usm.asid);
 			if (err)
 				goto wait;
 			++fence_id;
+			++__fence;
 
 			if (!tile->media_gt)
 				continue;
 
 			xe_gt_tlb_invalidation_fence_init(tile->media_gt,
-							  &fence[fence_id], true);
+							  __fence, true);
 
 			err = xe_gt_tlb_invalidation_range(tile->media_gt,
-							   &fence[fence_id],
+							   __fence,
 							   start,
 							   end,
 							   vm->usm.asid);
 			if (err)
 				goto wait;
 			++fence_id;
+			++__fence;
 		}
 	}
 
 wait:
-	for (id = 0; id < fence_id; ++id)
-		xe_gt_tlb_invalidation_fence_wait(&fence[id]);
+	if (!fences) {
+		for (id = 0; id < fence_id; ++id)
+			xe_gt_tlb_invalidation_fence_wait(&fence[id]);
+	}
 
 	return err;
 }
@@ -3958,7 +3965,8 @@ int xe_vm_invalidate_vma(struct xe_vma *vma)
 
 	xe_device_wmb(xe);
 
-	ret = xe_vm_range_tilemask_tlb_invalidation(xe_vma_vm(vma), xe_vma_start(vma),
+	ret = xe_vm_range_tilemask_tlb_invalidation(xe_vma_vm(vma), NULL,
+						    xe_vma_start(vma),
 						    xe_vma_end(vma), tile_mask);
 
 	/* WRITE_ONCE pairs with READ_ONCE in xe_vm_has_valid_gpu_mapping() */
diff --git a/drivers/gpu/drm/xe/xe_vm.h b/drivers/gpu/drm/xe/xe_vm.h
index 3475a118f666..d1c3c9aa8d03 100644
--- a/drivers/gpu/drm/xe/xe_vm.h
+++ b/drivers/gpu/drm/xe/xe_vm.h
@@ -22,6 +22,7 @@ struct dma_fence;
 
 struct xe_exec_queue;
 struct xe_file;
+struct xe_gt_tlb_invalidation_fence;
 struct xe_sync_entry;
 struct xe_svm_range;
 struct drm_exec;
@@ -228,8 +229,9 @@ struct dma_fence *xe_vm_range_rebind(struct xe_vm *vm,
 struct dma_fence *xe_vm_range_unbind(struct xe_vm *vm,
 				     struct xe_svm_range *range);
 
-int xe_vm_range_tilemask_tlb_invalidation(struct xe_vm *vm, u64 start,
-					  u64 end, u8 tile_mask);
+int xe_vm_range_tilemask_tlb_invalidation(struct xe_vm *vm,
+					  struct xe_gt_tlb_invalidation_fence *fences,
+					  u64 start, u64 end, u8 tile_mask);
 
 int xe_vm_invalidate_vma(struct xe_vma *vma);
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 6/6] drm/xe: Implement two pass MMU notifiers for SVM
  2025-08-09 13:51 [RFC PATCH 0/6] Multi-pass MMU interval notifiers Thomas Hellström
                   ` (4 preceding siblings ...)
  2025-08-09 13:51 ` [RFC PATCH 5/6] drm/xe: Add fences argument to xe_vm_range_tilemask_tlb_invalidation Thomas Hellström
@ 2025-08-09 13:51 ` Thomas Hellström
  2025-08-11 20:46   ` Matthew Brost
  5 siblings, 1 reply; 22+ messages in thread
From: Thomas Hellström @ 2025-08-09 13:51 UTC (permalink / raw)
  To: intel-xe
  Cc: Matthew Brost, Christian König, dri-devel, Jason Gunthorpe,
	Andrew Morton, Simona Vetter, Dave Airlie, linux-mm, linux-kernel

From: Matthew Brost <matthew.brost@intel.com>

Implement two-pass MMU notifiers for SVM, enabling multiple VMs or
devices with GPU mappings to pipeline costly TLB invalidations by
issuing them in the first pass and waiting for completion in the second.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/drm_gpusvm.c |  2 +-
 drivers/gpu/drm/xe/xe_svm.c  | 74 ++++++++++++++++++++++++++++++------
 2 files changed, 63 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/drm_gpusvm.c b/drivers/gpu/drm/drm_gpusvm.c
index 92dc7d2bd6cf..f153df1bc862 100644
--- a/drivers/gpu/drm/drm_gpusvm.c
+++ b/drivers/gpu/drm/drm_gpusvm.c
@@ -413,7 +413,7 @@ drm_gpusvm_notifier_invalidate_twopass(struct mmu_interval_notifier *mni,
  * drm_gpusvm_notifier_ops - MMU interval notifier operations for GPU SVM
  */
 static const struct mmu_interval_notifier_ops drm_gpusvm_notifier_ops = {
-	.invalidate_twopass = drm_gpusvm_notifier_invalidate_twopass,
+	.invalidate_multipass = drm_gpusvm_notifier_invalidate_twopass,
 };
 
 /**
diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
index 82a598c8d56e..5728394806ca 100644
--- a/drivers/gpu/drm/xe/xe_svm.c
+++ b/drivers/gpu/drm/xe/xe_svm.c
@@ -144,15 +144,8 @@ xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct drm_gpusvm_range *r,
 	 * invalidations spanning multiple ranges.
 	 */
 	for_each_tile(tile, xe, id)
-		if (xe_pt_zap_ptes_range(tile, vm, range)) {
+		if (xe_pt_zap_ptes_range(tile, vm, range))
 			tile_mask |= BIT(id);
-			/*
-			 * WRITE_ONCE pairs with READ_ONCE in
-			 * xe_vm_has_valid_gpu_mapping()
-			 */
-			WRITE_ONCE(range->tile_invalidated,
-				   range->tile_invalidated | BIT(id));
-		}
 
 	return tile_mask;
 }
@@ -161,16 +154,60 @@ static void
 xe_svm_range_notifier_event_end(struct xe_vm *vm, struct drm_gpusvm_range *r,
 				const struct mmu_notifier_range *mmu_range)
 {
+	struct xe_svm_range *range = to_xe_range(r);
 	struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
 
 	xe_svm_assert_in_notifier(vm);
 
+	/*
+	 * WRITE_ONCE pairs with READ_ONCE in xe_vm_has_valid_gpu_mapping()
+	 */
+	WRITE_ONCE(range->tile_invalidated, range->tile_present);
+
 	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r, &ctx);
 	if (!xe_vm_is_closed(vm) && mmu_range->event == MMU_NOTIFY_UNMAP)
 		xe_svm_garbage_collector_add_range(vm, to_xe_range(r),
 						   mmu_range);
 }
 
+struct xe_svm_invalidate_pass {
+	struct drm_gpusvm *gpusvm;
+	struct drm_gpusvm_notifier *notifier;
+#define XE_SVM_INVALIDATE_FENCE_COUNT	\
+	(XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE)
+	struct xe_gt_tlb_invalidation_fence fences[XE_SVM_INVALIDATE_FENCE_COUNT];
+	struct mmu_interval_notifier_pass p;
+};
+
+static struct mmu_interval_notifier_pass *
+xe_svm_invalidate_second(struct mmu_interval_notifier_pass *p,
+			 const struct mmu_notifier_range *mmu_range,
+			 unsigned long cur_seq)
+{
+	struct xe_svm_invalidate_pass *pass = container_of(p, typeof(*pass), p);
+	struct drm_gpusvm *gpusvm = pass->gpusvm;
+	struct drm_gpusvm_notifier *notifier = pass->notifier;
+	struct drm_gpusvm_range *r = NULL;
+	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
+	u64 adj_start = mmu_range->start, adj_end = mmu_range->end;
+	int id;
+
+	/* Adjust invalidation to notifier boundaries */
+	adj_start = max(drm_gpusvm_notifier_start(notifier), adj_start);
+	adj_end = min(drm_gpusvm_notifier_end(notifier), adj_end);
+
+	for (id = 0; id < XE_SVM_INVALIDATE_FENCE_COUNT; ++id)
+		xe_gt_tlb_invalidation_fence_wait(&pass->fences[id]);
+
+	drm_gpusvm_in_notifier_lock(gpusvm);
+	drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
+		xe_svm_range_notifier_event_end(vm, r, mmu_range);
+	drm_gpusvm_in_notifier_unlock(gpusvm);
+
+	kfree(pass);
+	return NULL;
+}
+
 static void xe_svm_invalidate_twopass(struct drm_gpusvm *gpusvm,
 				      struct drm_gpusvm_notifier *notifier,
 				      const struct mmu_notifier_range *mmu_range,
@@ -179,6 +216,8 @@ static void xe_svm_invalidate_twopass(struct drm_gpusvm *gpusvm,
 	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
 	struct xe_device *xe = vm->xe;
 	struct drm_gpusvm_range *r, *first;
+	struct xe_svm_invalidate_pass *pass = NULL;
+	struct xe_gt_tlb_invalidation_fence *fences = NULL;
 	u64 adj_start = mmu_range->start, adj_end = mmu_range->end;
 	u8 tile_mask = 0;
 	long err;
@@ -226,14 +265,25 @@ static void xe_svm_invalidate_twopass(struct drm_gpusvm *gpusvm,
 
 	xe_device_wmb(xe);
 
-	err = xe_vm_range_tilemask_tlb_invalidation(vm, NULL, adj_start,
+	pass = kzalloc(sizeof(*pass), GFP_NOWAIT);
+	if (pass) {
+		pass->gpusvm = gpusvm;
+		pass->notifier = notifier;
+		pass->p.pass = xe_svm_invalidate_second;
+		fences = pass->fences;
+		*p = &pass->p;
+	}
+
+	err = xe_vm_range_tilemask_tlb_invalidation(vm, fences, adj_start,
 						    adj_end, tile_mask);
 	WARN_ON_ONCE(err);
 
 range_notifier_event_end:
-	r = first;
-	drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
-		xe_svm_range_notifier_event_end(vm, r, mmu_range);
+	if (!pass) {
+		r = first;
+		drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
+			xe_svm_range_notifier_event_end(vm, r, mmu_range);
+	}
 }
 
 static int __xe_svm_garbage_collector(struct xe_vm *vm,
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 6/6] drm/xe: Implement two pass MMU notifiers for SVM
  2025-08-09 13:51 ` [RFC PATCH 6/6] drm/xe: Implement two pass MMU notifiers for SVM Thomas Hellström
@ 2025-08-11 20:46   ` Matthew Brost
  2025-08-12  9:06     ` Thomas Hellström
  0 siblings, 1 reply; 22+ messages in thread
From: Matthew Brost @ 2025-08-11 20:46 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Christian König, dri-devel, Jason Gunthorpe,
	Andrew Morton, Simona Vetter, Dave Airlie, linux-mm, linux-kernel

On Sat, Aug 09, 2025 at 03:51:37PM +0200, Thomas Hellström wrote:
> From: Matthew Brost <matthew.brost@intel.com>
> 
> Implement two-pass MMU notifiers for SVM, enabling multiple VMs or
> devices with GPU mappings to pipeline costly TLB invalidations by
> issuing them in the first pass and waiting for completion in the second.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/drm_gpusvm.c |  2 +-
>  drivers/gpu/drm/xe/xe_svm.c  | 74 ++++++++++++++++++++++++++++++------
>  2 files changed, 63 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_gpusvm.c b/drivers/gpu/drm/drm_gpusvm.c
> index 92dc7d2bd6cf..f153df1bc862 100644
> --- a/drivers/gpu/drm/drm_gpusvm.c
> +++ b/drivers/gpu/drm/drm_gpusvm.c
> @@ -413,7 +413,7 @@ drm_gpusvm_notifier_invalidate_twopass(struct mmu_interval_notifier *mni,
>   * drm_gpusvm_notifier_ops - MMU interval notifier operations for GPU SVM
>   */
>  static const struct mmu_interval_notifier_ops drm_gpusvm_notifier_ops = {
> -	.invalidate_twopass = drm_gpusvm_notifier_invalidate_twopass,
> +	.invalidate_multipass = drm_gpusvm_notifier_invalidate_twopass,

This should be in patch #2.

Matt

>  };
>  
>  /**
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index 82a598c8d56e..5728394806ca 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -144,15 +144,8 @@ xe_svm_range_notifier_event_begin(struct xe_vm *vm, struct drm_gpusvm_range *r,
>  	 * invalidations spanning multiple ranges.
>  	 */
>  	for_each_tile(tile, xe, id)
> -		if (xe_pt_zap_ptes_range(tile, vm, range)) {
> +		if (xe_pt_zap_ptes_range(tile, vm, range))
>  			tile_mask |= BIT(id);
> -			/*
> -			 * WRITE_ONCE pairs with READ_ONCE in
> -			 * xe_vm_has_valid_gpu_mapping()
> -			 */
> -			WRITE_ONCE(range->tile_invalidated,
> -				   range->tile_invalidated | BIT(id));
> -		}
>  
>  	return tile_mask;
>  }
> @@ -161,16 +154,60 @@ static void
>  xe_svm_range_notifier_event_end(struct xe_vm *vm, struct drm_gpusvm_range *r,
>  				const struct mmu_notifier_range *mmu_range)
>  {
> +	struct xe_svm_range *range = to_xe_range(r);
>  	struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
>  
>  	xe_svm_assert_in_notifier(vm);
>  
> +	/*
> +	 * WRITE_ONCE pairs with READ_ONCE in xe_vm_has_valid_gpu_mapping()
> +	 */
> +	WRITE_ONCE(range->tile_invalidated, range->tile_present);
> +
>  	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r, &ctx);
>  	if (!xe_vm_is_closed(vm) && mmu_range->event == MMU_NOTIFY_UNMAP)
>  		xe_svm_garbage_collector_add_range(vm, to_xe_range(r),
>  						   mmu_range);
>  }
>  
> +struct xe_svm_invalidate_pass {
> +	struct drm_gpusvm *gpusvm;
> +	struct drm_gpusvm_notifier *notifier;
> +#define XE_SVM_INVALIDATE_FENCE_COUNT	\
> +	(XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE)
> +	struct xe_gt_tlb_invalidation_fence fences[XE_SVM_INVALIDATE_FENCE_COUNT];
> +	struct mmu_interval_notifier_pass p;
> +};
> +
> +static struct mmu_interval_notifier_pass *
> +xe_svm_invalidate_second(struct mmu_interval_notifier_pass *p,
> +			 const struct mmu_notifier_range *mmu_range,
> +			 unsigned long cur_seq)
> +{
> +	struct xe_svm_invalidate_pass *pass = container_of(p, typeof(*pass), p);
> +	struct drm_gpusvm *gpusvm = pass->gpusvm;
> +	struct drm_gpusvm_notifier *notifier = pass->notifier;
> +	struct drm_gpusvm_range *r = NULL;
> +	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
> +	u64 adj_start = mmu_range->start, adj_end = mmu_range->end;
> +	int id;
> +
> +	/* Adjust invalidation to notifier boundaries */
> +	adj_start = max(drm_gpusvm_notifier_start(notifier), adj_start);
> +	adj_end = min(drm_gpusvm_notifier_end(notifier), adj_end);
> +
> +	for (id = 0; id < XE_SVM_INVALIDATE_FENCE_COUNT; ++id)
> +		xe_gt_tlb_invalidation_fence_wait(&pass->fences[id]);
> +
> +	drm_gpusvm_in_notifier_lock(gpusvm);
> +	drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
> +		xe_svm_range_notifier_event_end(vm, r, mmu_range);
> +	drm_gpusvm_in_notifier_unlock(gpusvm);
> +
> +	kfree(pass);
> +	return NULL;
> +}
> +
>  static void xe_svm_invalidate_twopass(struct drm_gpusvm *gpusvm,
>  				      struct drm_gpusvm_notifier *notifier,
>  				      const struct mmu_notifier_range *mmu_range,
> @@ -179,6 +216,8 @@ static void xe_svm_invalidate_twopass(struct drm_gpusvm *gpusvm,
>  	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
>  	struct xe_device *xe = vm->xe;
>  	struct drm_gpusvm_range *r, *first;
> +	struct xe_svm_invalidate_pass *pass = NULL;
> +	struct xe_gt_tlb_invalidation_fence *fences = NULL;
>  	u64 adj_start = mmu_range->start, adj_end = mmu_range->end;
>  	u8 tile_mask = 0;
>  	long err;
> @@ -226,14 +265,25 @@ static void xe_svm_invalidate_twopass(struct drm_gpusvm *gpusvm,
>  
>  	xe_device_wmb(xe);
>  
> -	err = xe_vm_range_tilemask_tlb_invalidation(vm, NULL, adj_start,
> +	pass = kzalloc(sizeof(*pass), GFP_NOWAIT);
> +	if (pass) {
> +		pass->gpusvm = gpusvm;
> +		pass->notifier = notifier;
> +		pass->p.pass = xe_svm_invalidate_second;
> +		fences = pass->fences;
> +		*p = &pass->p;
> +	}
> +
> +	err = xe_vm_range_tilemask_tlb_invalidation(vm, fences, adj_start,
>  						    adj_end, tile_mask);
>  	WARN_ON_ONCE(err);
>  
>  range_notifier_event_end:
> -	r = first;
> -	drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
> -		xe_svm_range_notifier_event_end(vm, r, mmu_range);
> +	if (!pass) {
> +		r = first;
> +		drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
> +			xe_svm_range_notifier_event_end(vm, r, mmu_range);
> +	}
>  }
>  
>  static int __xe_svm_garbage_collector(struct xe_vm *vm,
> -- 
> 2.50.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 6/6] drm/xe: Implement two pass MMU notifiers for SVM
  2025-08-11 20:46   ` Matthew Brost
@ 2025-08-12  9:06     ` Thomas Hellström
  0 siblings, 0 replies; 22+ messages in thread
From: Thomas Hellström @ 2025-08-12  9:06 UTC (permalink / raw)
  To: Matthew Brost
  Cc: intel-xe, Christian König, dri-devel, Jason Gunthorpe,
	Andrew Morton, Simona Vetter, Dave Airlie, linux-mm, linux-kernel

On Mon, 2025-08-11 at 13:46 -0700, Matthew Brost wrote:
> On Sat, Aug 09, 2025 at 03:51:37PM +0200, Thomas Hellström wrote:
> > From: Matthew Brost <matthew.brost@intel.com>
> > 
> > Implement two-pass MMU notifiers for SVM, enabling multiple VMs or
> > devices with GPU mappings to pipeline costly TLB invalidations by
> > issuing them in the first pass and waiting for completion in the
> > second.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/drm_gpusvm.c |  2 +-
> >  drivers/gpu/drm/xe/xe_svm.c  | 74 ++++++++++++++++++++++++++++++--
> > ----
> >  2 files changed, 63 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/drm_gpusvm.c
> > b/drivers/gpu/drm/drm_gpusvm.c
> > index 92dc7d2bd6cf..f153df1bc862 100644
> > --- a/drivers/gpu/drm/drm_gpusvm.c
> > +++ b/drivers/gpu/drm/drm_gpusvm.c
> > @@ -413,7 +413,7 @@ drm_gpusvm_notifier_invalidate_twopass(struct
> > mmu_interval_notifier *mni,
> >   * drm_gpusvm_notifier_ops - MMU interval notifier operations for
> > GPU SVM
> >   */
> >  static const struct mmu_interval_notifier_ops
> > drm_gpusvm_notifier_ops = {
> > -	.invalidate_twopass =
> > drm_gpusvm_notifier_invalidate_twopass,
> > +	.invalidate_multipass =
> > drm_gpusvm_notifier_invalidate_twopass,
> 
> This should be in patch #2.

Yup. My bad fixing up for the interface change in patch 1. Sorry for
that.
/Thomas


> 
> Matt
> 
> >  };
> >  
> >  /**
> > diff --git a/drivers/gpu/drm/xe/xe_svm.c
> > b/drivers/gpu/drm/xe/xe_svm.c
> > index 82a598c8d56e..5728394806ca 100644
> > --- a/drivers/gpu/drm/xe/xe_svm.c
> > +++ b/drivers/gpu/drm/xe/xe_svm.c
> > @@ -144,15 +144,8 @@ xe_svm_range_notifier_event_begin(struct xe_vm
> > *vm, struct drm_gpusvm_range *r,
> >  	 * invalidations spanning multiple ranges.
> >  	 */
> >  	for_each_tile(tile, xe, id)
> > -		if (xe_pt_zap_ptes_range(tile, vm, range)) {
> > +		if (xe_pt_zap_ptes_range(tile, vm, range))
> >  			tile_mask |= BIT(id);
> > -			/*
> > -			 * WRITE_ONCE pairs with READ_ONCE in
> > -			 * xe_vm_has_valid_gpu_mapping()
> > -			 */
> > -			WRITE_ONCE(range->tile_invalidated,
> > -				   range->tile_invalidated |
> > BIT(id));
> > -		}
> >  
> >  	return tile_mask;
> >  }
> > @@ -161,16 +154,60 @@ static void
> >  xe_svm_range_notifier_event_end(struct xe_vm *vm, struct
> > drm_gpusvm_range *r,
> >  				const struct mmu_notifier_range
> > *mmu_range)
> >  {
> > +	struct xe_svm_range *range = to_xe_range(r);
> >  	struct drm_gpusvm_ctx ctx = { .in_notifier = true, };
> >  
> >  	xe_svm_assert_in_notifier(vm);
> >  
> > +	/*
> > +	 * WRITE_ONCE pairs with READ_ONCE in
> > xe_vm_has_valid_gpu_mapping()
> > +	 */
> > +	WRITE_ONCE(range->tile_invalidated, range->tile_present);
> > +
> >  	drm_gpusvm_range_unmap_pages(&vm->svm.gpusvm, r, &ctx);
> >  	if (!xe_vm_is_closed(vm) && mmu_range->event ==
> > MMU_NOTIFY_UNMAP)
> >  		xe_svm_garbage_collector_add_range(vm,
> > to_xe_range(r),
> >  						   mmu_range);
> >  }
> >  
> > +struct xe_svm_invalidate_pass {
> > +	struct drm_gpusvm *gpusvm;
> > +	struct drm_gpusvm_notifier *notifier;
> > +#define XE_SVM_INVALIDATE_FENCE_COUNT	\
> > +	(XE_MAX_TILES_PER_DEVICE * XE_MAX_GT_PER_TILE)
> > +	struct xe_gt_tlb_invalidation_fence
> > fences[XE_SVM_INVALIDATE_FENCE_COUNT];
> > +	struct mmu_interval_notifier_pass p;
> > +};
> > +
> > +static struct mmu_interval_notifier_pass *
> > +xe_svm_invalidate_second(struct mmu_interval_notifier_pass *p,
> > +			 const struct mmu_notifier_range
> > *mmu_range,
> > +			 unsigned long cur_seq)
> > +{
> > +	struct xe_svm_invalidate_pass *pass = container_of(p,
> > typeof(*pass), p);
> > +	struct drm_gpusvm *gpusvm = pass->gpusvm;
> > +	struct drm_gpusvm_notifier *notifier = pass->notifier;
> > +	struct drm_gpusvm_range *r = NULL;
> > +	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
> > +	u64 adj_start = mmu_range->start, adj_end = mmu_range-
> > >end;
> > +	int id;
> > +
> > +	/* Adjust invalidation to notifier boundaries */
> > +	adj_start = max(drm_gpusvm_notifier_start(notifier),
> > adj_start);
> > +	adj_end = min(drm_gpusvm_notifier_end(notifier), adj_end);
> > +
> > +	for (id = 0; id < XE_SVM_INVALIDATE_FENCE_COUNT; ++id)
> > +		xe_gt_tlb_invalidation_fence_wait(&pass-
> > >fences[id]);
> > +
> > +	drm_gpusvm_in_notifier_lock(gpusvm);
> > +	drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
> > +		xe_svm_range_notifier_event_end(vm, r, mmu_range);
> > +	drm_gpusvm_in_notifier_unlock(gpusvm);
> > +
> > +	kfree(pass);
> > +	return NULL;
> > +}
> > +
> >  static void xe_svm_invalidate_twopass(struct drm_gpusvm *gpusvm,
> >  				      struct drm_gpusvm_notifier
> > *notifier,
> >  				      const struct
> > mmu_notifier_range *mmu_range,
> > @@ -179,6 +216,8 @@ static void xe_svm_invalidate_twopass(struct
> > drm_gpusvm *gpusvm,
> >  	struct xe_vm *vm = gpusvm_to_vm(gpusvm);
> >  	struct xe_device *xe = vm->xe;
> >  	struct drm_gpusvm_range *r, *first;
> > +	struct xe_svm_invalidate_pass *pass = NULL;
> > +	struct xe_gt_tlb_invalidation_fence *fences = NULL;
> >  	u64 adj_start = mmu_range->start, adj_end = mmu_range-
> > >end;
> >  	u8 tile_mask = 0;
> >  	long err;
> > @@ -226,14 +265,25 @@ static void xe_svm_invalidate_twopass(struct
> > drm_gpusvm *gpusvm,
> >  
> >  	xe_device_wmb(xe);
> >  
> > -	err = xe_vm_range_tilemask_tlb_invalidation(vm, NULL,
> > adj_start,
> > +	pass = kzalloc(sizeof(*pass), GFP_NOWAIT);
> > +	if (pass) {
> > +		pass->gpusvm = gpusvm;
> > +		pass->notifier = notifier;
> > +		pass->p.pass = xe_svm_invalidate_second;
> > +		fences = pass->fences;
> > +		*p = &pass->p;
> > +	}
> > +
> > +	err = xe_vm_range_tilemask_tlb_invalidation(vm, fences,
> > adj_start,
> >  						    adj_end,
> > tile_mask);
> >  	WARN_ON_ONCE(err);
> >  
> >  range_notifier_event_end:
> > -	r = first;
> > -	drm_gpusvm_for_each_range(r, notifier, adj_start, adj_end)
> > -		xe_svm_range_notifier_event_end(vm, r, mmu_range);
> > +	if (!pass) {
> > +		r = first;
> > +		drm_gpusvm_for_each_range(r, notifier, adj_start,
> > adj_end)
> > +			xe_svm_range_notifier_event_end(vm, r,
> > mmu_range);
> > +	}
> >  }
> >  
> >  static int __xe_svm_garbage_collector(struct xe_vm *vm,
> > -- 
> > 2.50.1
> > 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes
  2025-08-09 13:51 ` [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes Thomas Hellström
@ 2025-08-18 16:07   ` Jason Gunthorpe
  2025-08-18 16:25     ` Matthew Brost
  2025-08-19 10:03   ` Alistair Popple
  1 sibling, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2025-08-18 16:07 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Andrew Morton, Simona Vetter, Dave Airlie, dri-devel,
	linux-mm, linux-kernel, Matthew Brost, Christian König

On Sat, Aug 09, 2025 at 03:51:32PM +0200, Thomas Hellström wrote:
> GPU use-cases for mmu_interval_notifiers with hmm often involve
> starting a gpu operation and then waiting for it to complete.
> These operations are typically context preemption or TLB flushing.
> 
> With single-pass notifiers per GPU this doesn't scale in
> multi-gpu scenarios. In those scenarios we'd want to first start
> preemption- or TLB flushing on all GPUs and as a second pass wait
> for them to complete on all gpus.

The idea seems reasonable but I'm not sure I like the naming of
'multipass' or necessarily the complexity.

This is sort of a co-operative multithreading thing.

Do you really need a linked list here? At least justify the design
choices in the commit message..

> +struct mmu_interval_notifier_pass {
> +	struct list_head link;
> +	/**
> +	 * @pass: Driver callback for additionall pass.
> +	 * @additional_pass: Pointer to the mmu_interval_notifier_pass structure.
> +	 * @range: The mmu_notifier_range.
> +	 * @cur_seq: The current sequence set by the first pass.
> +	 *
> +	 * Return: Either a pointer to a valid mmu_interval_notifier_pass for
> +	 * another pass to be called, or %NULL if processing is complete for this
> +	 * notifier. There is no error reporting mechanism for additional passes.
> +	 */
> +	struct mmu_interval_notifier_pass *
> +	(*pass) (struct mmu_interval_notifier_pass *additional_pass,
> +		 const struct mmu_notifier_range *range,
> +		 unsigned long cur_seq);
> +};
> +
>  /**
>   * struct mmu_interval_notifier_ops
>   * @invalidate: Upon return the caller must stop using any SPTEs within this
> @@ -243,6 +269,10 @@ struct mmu_interval_notifier_ops {
>  	bool (*invalidate)(struct mmu_interval_notifier *interval_sub,
>  			   const struct mmu_notifier_range *range,
>  			   unsigned long cur_seq);
> +	bool (*invalidate_multipass)(struct mmu_interval_notifier *interval_sub,
> +				     const struct mmu_notifier_range *range,
> +				     unsigned long cur_seq,
> +				     struct mmu_interval_notifier_pass **pass);

Couldn't this just have a pass number counter and some return code to
indicate this notifier is done?

Or do you really need more than 2 passes? Start/finish make sense
too. Otherwise you may have issues overlapping the backgroundable
operations between different driver types?

> +static void mn_itree_additional_passes(struct list_head *additional_passes,
> +				       const struct mmu_notifier_range *range,
> +				       unsigned long cur_seq)
> +{
> +	struct mmu_interval_notifier_pass *p, *next;
> +
> +	while (!list_empty(additional_passes)) {
> +		list_for_each_entry_safe(p, next, additional_passes, link) {
> +			list_del_init(&p->link);
> +			p = p->pass(p, range, cur_seq);
> +			if (p)
> +				list_add_tail(&p->link, additional_passes);
> +		}
> +	}

Like this is very naive, if one driver has only 'prepare' and 'wait
for device ack' passes, then it will immediately stop being concurrent
while another device may be still working on its 3rd pass.

So either this should be more complicated to properly support
different numbers of passes per registration or we should just support
two passes and be done with it?

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes
  2025-08-18 16:07   ` Jason Gunthorpe
@ 2025-08-18 16:25     ` Matthew Brost
  2025-08-18 16:36       ` Jason Gunthorpe
  0 siblings, 1 reply; 22+ messages in thread
From: Matthew Brost @ 2025-08-18 16:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Hellström, intel-xe, Andrew Morton, Simona Vetter,
	Dave Airlie, dri-devel, linux-mm, linux-kernel,
	Christian König

On Mon, Aug 18, 2025 at 01:07:26PM -0300, Jason Gunthorpe wrote:
> On Sat, Aug 09, 2025 at 03:51:32PM +0200, Thomas Hellström wrote:
> > GPU use-cases for mmu_interval_notifiers with hmm often involve
> > starting a gpu operation and then waiting for it to complete.
> > These operations are typically context preemption or TLB flushing.
> > 
> > With single-pass notifiers per GPU this doesn't scale in
> > multi-gpu scenarios. In those scenarios we'd want to first start
> > preemption- or TLB flushing on all GPUs and as a second pass wait
> > for them to complete on all gpus.
> 
> The idea seems reasonable but I'm not sure I like the naming of
> 'multipass' or necessarily the complexity.
> 
> This is sort of a co-operative multithreading thing.
> 
> Do you really need a linked list here? At least justify the design
> choices in the commit message..
> 

I think this choice makes sense: it allows embedding the wait state from
the initial notifier call into the pass structure. Patch [6] shows this
by attaching the issued TLB invalidation fences to the pass. Since a
single notifier may be invoked multiple times with different ranges but
the same seqno, I think this is the correct design choice—otherwise
there’s no unambiguous way to track per-invocation wait state. I agree
this should be documented in both the commit message and kernel-doc.

Matt

[6] https://patchwork.freedesktop.org/patch/667844/?series=152725&rev=1

> > +struct mmu_interval_notifier_pass {
> > +	struct list_head link;
> > +	/**
> > +	 * @pass: Driver callback for additionall pass.
> > +	 * @additional_pass: Pointer to the mmu_interval_notifier_pass structure.
> > +	 * @range: The mmu_notifier_range.
> > +	 * @cur_seq: The current sequence set by the first pass.
> > +	 *
> > +	 * Return: Either a pointer to a valid mmu_interval_notifier_pass for
> > +	 * another pass to be called, or %NULL if processing is complete for this
> > +	 * notifier. There is no error reporting mechanism for additional passes.
> > +	 */
> > +	struct mmu_interval_notifier_pass *
> > +	(*pass) (struct mmu_interval_notifier_pass *additional_pass,
> > +		 const struct mmu_notifier_range *range,
> > +		 unsigned long cur_seq);
> > +};
> > +
> >  /**
> >   * struct mmu_interval_notifier_ops
> >   * @invalidate: Upon return the caller must stop using any SPTEs within this
> > @@ -243,6 +269,10 @@ struct mmu_interval_notifier_ops {
> >  	bool (*invalidate)(struct mmu_interval_notifier *interval_sub,
> >  			   const struct mmu_notifier_range *range,
> >  			   unsigned long cur_seq);
> > +	bool (*invalidate_multipass)(struct mmu_interval_notifier *interval_sub,
> > +				     const struct mmu_notifier_range *range,
> > +				     unsigned long cur_seq,
> > +				     struct mmu_interval_notifier_pass **pass);
> 
> Couldn't this just have a pass number counter and some return code to
> indicate this notifier is done?
> 
> Or do you really need more than 2 passes? Start/finish make sense
> too. Otherwise you may have issues overlapping the backgroundable
> operations between different driver types?
> 
> > +static void mn_itree_additional_passes(struct list_head *additional_passes,
> > +				       const struct mmu_notifier_range *range,
> > +				       unsigned long cur_seq)
> > +{
> > +	struct mmu_interval_notifier_pass *p, *next;
> > +
> > +	while (!list_empty(additional_passes)) {
> > +		list_for_each_entry_safe(p, next, additional_passes, link) {
> > +			list_del_init(&p->link);
> > +			p = p->pass(p, range, cur_seq);
> > +			if (p)
> > +				list_add_tail(&p->link, additional_passes);
> > +		}
> > +	}
> 
> Like this is very naive, if one driver has only 'prepare' and 'wait
> for device ack' passes, then it will immediately stop being concurrent
> while another device may be still working on its 3rd pass.
> 
> So either this should be more complicated to properly support
> different numbers of passes per registration or we should just support
> two passes and be done with it?
> 
> Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes
  2025-08-18 16:25     ` Matthew Brost
@ 2025-08-18 16:36       ` Jason Gunthorpe
  2025-08-18 16:42         ` Thomas Hellström
  2025-08-18 16:44         ` Matthew Brost
  0 siblings, 2 replies; 22+ messages in thread
From: Jason Gunthorpe @ 2025-08-18 16:36 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Thomas Hellström, intel-xe, Andrew Morton, Simona Vetter,
	Dave Airlie, dri-devel, linux-mm, linux-kernel,
	Christian König

On Mon, Aug 18, 2025 at 09:25:20AM -0700, Matthew Brost wrote:
> I think this choice makes sense: it allows embedding the wait state from
> the initial notifier call into the pass structure. Patch [6] shows this
> by attaching the issued TLB invalidation fences to the pass. Since a
> single notifier may be invoked multiple times with different ranges but
> the same seqno,

That should be explained, but also seems to be a bit of a different
issue..

If the design is really to only have two passes and this linked list
is about retaining state then there should not be so much freedom to
have more passes.

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes
  2025-08-18 16:36       ` Jason Gunthorpe
@ 2025-08-18 16:42         ` Thomas Hellström
  2025-08-18 16:45           ` Matthew Brost
  2025-08-18 16:44         ` Matthew Brost
  1 sibling, 1 reply; 22+ messages in thread
From: Thomas Hellström @ 2025-08-18 16:42 UTC (permalink / raw)
  To: Jason Gunthorpe, Matthew Brost
  Cc: intel-xe, Andrew Morton, Simona Vetter, Dave Airlie, dri-devel,
	linux-mm, linux-kernel, Christian König

On Mon, 2025-08-18 at 13:36 -0300, Jason Gunthorpe wrote:
> On Mon, Aug 18, 2025 at 09:25:20AM -0700, Matthew Brost wrote:
> > I think this choice makes sense: it allows embedding the wait state
> > from
> > the initial notifier call into the pass structure. Patch [6] shows
> > this
> > by attaching the issued TLB invalidation fences to the pass. Since
> > a
> > single notifier may be invoked multiple times with different ranges
> > but
> > the same seqno,
> 
> That should be explained, but also seems to be a bit of a different
> issue..
> 
> If the design is really to only have two passes and this linked list
> is about retaining state then there should not be so much freedom to
> have more passes.

Actually the initial suggestion was two passes only. Then I thought I
saw a use-case for even three passes and added the multi-pass thing,
but I think it turned out we didn't have such a use-case. IMO we could
restrict it to two-pass. Matthew, that should be completely OK for the
SVM use-case, right?

/Thomas


> 
> Jason


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes
  2025-08-18 16:36       ` Jason Gunthorpe
  2025-08-18 16:42         ` Thomas Hellström
@ 2025-08-18 16:44         ` Matthew Brost
  2025-08-18 16:46           ` Jason Gunthorpe
  1 sibling, 1 reply; 22+ messages in thread
From: Matthew Brost @ 2025-08-18 16:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Hellström, intel-xe, Andrew Morton, Simona Vetter,
	Dave Airlie, dri-devel, linux-mm, linux-kernel,
	Christian König

On Mon, Aug 18, 2025 at 01:36:17PM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 18, 2025 at 09:25:20AM -0700, Matthew Brost wrote:
> > I think this choice makes sense: it allows embedding the wait state from
> > the initial notifier call into the pass structure. Patch [6] shows this
> > by attaching the issued TLB invalidation fences to the pass. Since a
> > single notifier may be invoked multiple times with different ranges but
> > the same seqno,
> 
> That should be explained, but also seems to be a bit of a different
> issue..
> 
> If the design is really to only have two passes and this linked list
> is about retaining state then there should not be so much freedom to
> have more passes.

I’ll let Thomas weigh in on whether we really need more than two passes;
my feeling is that two passes are likely sufficient. It’s also worth
noting that the linked list has an added benefit: the notifier tree only
needs to be walked once (a small time-complexity win).

Matt

> 
> Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes
  2025-08-18 16:42         ` Thomas Hellström
@ 2025-08-18 16:45           ` Matthew Brost
  0 siblings, 0 replies; 22+ messages in thread
From: Matthew Brost @ 2025-08-18 16:45 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Jason Gunthorpe, intel-xe, Andrew Morton, Simona Vetter,
	Dave Airlie, dri-devel, linux-mm, linux-kernel,
	Christian König

On Mon, Aug 18, 2025 at 06:42:36PM +0200, Thomas Hellström wrote:
> On Mon, 2025-08-18 at 13:36 -0300, Jason Gunthorpe wrote:
> > On Mon, Aug 18, 2025 at 09:25:20AM -0700, Matthew Brost wrote:
> > > I think this choice makes sense: it allows embedding the wait state
> > > from
> > > the initial notifier call into the pass structure. Patch [6] shows
> > > this
> > > by attaching the issued TLB invalidation fences to the pass. Since
> > > a
> > > single notifier may be invoked multiple times with different ranges
> > > but
> > > the same seqno,
> > 
> > That should be explained, but also seems to be a bit of a different
> > issue..
> > 
> > If the design is really to only have two passes and this linked list
> > is about retaining state then there should not be so much freedom to
> > have more passes.
> 
> Actually the initial suggestion was two passes only. Then I thought I
> saw a use-case for even three passes and added the multi-pass thing,
> but I think it turned out we didn't have such a use-case. IMO we could
> restrict it to two-pass. Matthew, that should be completely OK for the
> SVM use-case, right?
> 

Yea, I just replied that 2 passes should be sufficient.

Matt

> /Thomas
> 
> 
> > 
> > Jason
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes
  2025-08-18 16:44         ` Matthew Brost
@ 2025-08-18 16:46           ` Jason Gunthorpe
  2025-08-19  9:55             ` Alistair Popple
  0 siblings, 1 reply; 22+ messages in thread
From: Jason Gunthorpe @ 2025-08-18 16:46 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Thomas Hellström, intel-xe, Andrew Morton, Simona Vetter,
	Dave Airlie, dri-devel, linux-mm, linux-kernel,
	Christian König

On Mon, Aug 18, 2025 at 09:44:01AM -0700, Matthew Brost wrote:
> On Mon, Aug 18, 2025 at 01:36:17PM -0300, Jason Gunthorpe wrote:
> > On Mon, Aug 18, 2025 at 09:25:20AM -0700, Matthew Brost wrote:
> > > I think this choice makes sense: it allows embedding the wait state from
> > > the initial notifier call into the pass structure. Patch [6] shows this
> > > by attaching the issued TLB invalidation fences to the pass. Since a
> > > single notifier may be invoked multiple times with different ranges but
> > > the same seqno,
> > 
> > That should be explained, but also seems to be a bit of a different
> > issue..
> > 
> > If the design is really to only have two passes and this linked list
> > is about retaining state then there should not be so much freedom to
> > have more passes.
> 
> I’ll let Thomas weigh in on whether we really need more than two passes;
> my feeling is that two passes are likely sufficient. It’s also worth
> noting that the linked list has an added benefit: the notifier tree only
> needs to be walked once (a small time-complexity win).

You may end up keeping the linked list just with no way to add a third
pass.

Jason

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes
  2025-08-18 16:46           ` Jason Gunthorpe
@ 2025-08-19  9:55             ` Alistair Popple
  2025-08-19 11:33               ` Thomas Hellström
  0 siblings, 1 reply; 22+ messages in thread
From: Alistair Popple @ 2025-08-19  9:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matthew Brost, Thomas Hellström, intel-xe, Andrew Morton,
	Simona Vetter, Dave Airlie, dri-devel, linux-mm, linux-kernel,
	Christian König

On Mon, Aug 18, 2025 at 01:46:55PM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 18, 2025 at 09:44:01AM -0700, Matthew Brost wrote:
> > On Mon, Aug 18, 2025 at 01:36:17PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Aug 18, 2025 at 09:25:20AM -0700, Matthew Brost wrote:
> > > > I think this choice makes sense: it allows embedding the wait state from
> > > > the initial notifier call into the pass structure. Patch [6] shows this
> > > > by attaching the issued TLB invalidation fences to the pass. Since a
> > > > single notifier may be invoked multiple times with different ranges but
> > > > the same seqno,
> > > 
> > > That should be explained, but also seems to be a bit of a different
> > > issue..
> > > 
> > > If the design is really to only have two passes and this linked list
> > > is about retaining state then there should not be so much freedom to
> > > have more passes.
> > 
> > I’ll let Thomas weigh in on whether we really need more than two passes;
> > my feeling is that two passes are likely sufficient. It’s also worth
> > noting that the linked list has an added benefit: the notifier tree only
> > needs to be walked once (a small time-complexity win).
> 
> You may end up keeping the linked list just with no way to add a third
> pass.

It seems to me though that linked list still adds unnecessary complexity. I
think this would all be much easier to follow if we just added two new callbacks
- invalidate_start() and invalidate_end() say.

Admitedly that would still require the linked list (or something similar) to
retain the ability to hold/pass a context between the start and end callbacks.
Which is bit annoying, it's a pity we need to allocate memory in a performance
sensitive path to effectively pass (at least in this case) a single pointer. I
can't think of any obvious solutions to that though.

> Jason
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes
  2025-08-09 13:51 ` [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes Thomas Hellström
  2025-08-18 16:07   ` Jason Gunthorpe
@ 2025-08-19 10:03   ` Alistair Popple
  2025-08-19 11:35     ` Thomas Hellström
  1 sibling, 1 reply; 22+ messages in thread
From: Alistair Popple @ 2025-08-19 10:03 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: intel-xe, Jason Gunthorpe, Andrew Morton, Simona Vetter,
	Dave Airlie, dri-devel, linux-mm, linux-kernel, Matthew Brost,
	Christian König

On Sat, Aug 09, 2025 at 03:51:32PM +0200, Thomas Hellström wrote:
> GPU use-cases for mmu_interval_notifiers with hmm often involve
> starting a gpu operation and then waiting for it to complete.
> These operations are typically context preemption or TLB flushing.
> 
> With single-pass notifiers per GPU this doesn't scale in
> multi-gpu scenarios. In those scenarios we'd want to first start
> preemption- or TLB flushing on all GPUs and as a second pass wait
> for them to complete on all gpus.
> 
> One can do this on per-driver basis multiplexing per-driver
> notifiers but that would mean sharing the notifier "user" lock
> across all GPUs and that doesn't scale well either, so adding support
> for multi-pass in the core appears like the right choice.
> 
> Implement multi-pass capability in the mmu_interval_notifier. Use a
> linked list for the additional passes to minimize the impact for
> use-cases that don't need the multi-pass functionality.
> 
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Simona Vetter <simona.vetter@ffwll.ch>
> Cc: Dave Airlie <airlied@gmail.com>
> Cc: <dri-devel@lists.freedesktop.org>
> Cc: <linux-mm@kvack.org>
> Cc: <linux-kernel@vger.kernel.org>
> 
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>  include/linux/mmu_notifier.h | 30 ++++++++++++++++
>  mm/mmu_notifier.c            | 67 +++++++++++++++++++++++++++++++-----
>  2 files changed, 88 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index d1094c2d5fb6..1107a8eafd8a 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -233,6 +233,32 @@ struct mmu_notifier {
>  	unsigned int users;
>  };
>  
> +/**
> + * struct mmu_interval_notifier_pass - mmu_interval_notifier multi-pass abstraction
> + * @link: List link for the notifiers pending pass list
> + *
> + * Allocate, typically using GFP_NOWAIT in the interval notifier's first pass.
> + * If allocation fails (which is not unlikely under memory pressure), fall back
> + * to single-pass operation.
> + */
> +struct mmu_interval_notifier_pass {

If we limit the number of passes to two maybe call this
`mmu_interval_notifier_finish()`? ...

> +	struct list_head link;
> +	/**
> +	 * @pass: Driver callback for additionall pass.
> +	 * @additional_pass: Pointer to the mmu_interval_notifier_pass structure.
> +	 * @range: The mmu_notifier_range.
> +	 * @cur_seq: The current sequence set by the first pass.
> +	 *
> +	 * Return: Either a pointer to a valid mmu_interval_notifier_pass for
> +	 * another pass to be called, or %NULL if processing is complete for this
> +	 * notifier. There is no error reporting mechanism for additional passes.
> +	 */
> +	struct mmu_interval_notifier_pass *
> +	(*pass) (struct mmu_interval_notifier_pass *additional_pass,

... and call this `finish()` ...

> +		 const struct mmu_notifier_range *range,
> +		 unsigned long cur_seq);
> +};
> +
>  /**
>   * struct mmu_interval_notifier_ops
>   * @invalidate: Upon return the caller must stop using any SPTEs within this
> @@ -243,6 +269,10 @@ struct mmu_interval_notifier_ops {
>  	bool (*invalidate)(struct mmu_interval_notifier *interval_sub,
>  			   const struct mmu_notifier_range *range,
>  			   unsigned long cur_seq);
> +	bool (*invalidate_multipass)(struct mmu_interval_notifier *interval_sub,

... and then this could be called `invalidate_start()`. That might address some
of the concerns with naming.

> +				     const struct mmu_notifier_range *range,
> +				     unsigned long cur_seq,
> +				     struct mmu_interval_notifier_pass **pass);
>  };
>  
>  struct mmu_interval_notifier {
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 8e0125dc0522..dd6af87db103 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -260,6 +260,22 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub)
>  }
>  EXPORT_SYMBOL_GPL(mmu_interval_read_begin);
>  
> +static void mn_itree_additional_passes(struct list_head *additional_passes,
> +				       const struct mmu_notifier_range *range,
> +				       unsigned long cur_seq)
> +{
> +	struct mmu_interval_notifier_pass *p, *next;
> +
> +	while (!list_empty(additional_passes)) {
> +		list_for_each_entry_safe(p, next, additional_passes, link) {
> +			list_del_init(&p->link);
> +			p = p->pass(p, range, cur_seq);
> +			if (p)
> +				list_add_tail(&p->link, additional_passes);
> +		}
> +	}
> +}
> +
>  static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions,
>  			     struct mm_struct *mm)
>  {
> @@ -272,17 +288,32 @@ static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions,
>  	};
>  	struct mmu_interval_notifier *interval_sub;
>  	unsigned long cur_seq;
> +	LIST_HEAD(additional_passes);
>  	bool ret;
>  
>  	for (interval_sub =
>  		     mn_itree_inv_start_range(subscriptions, &range, &cur_seq);
>  	     interval_sub;
>  	     interval_sub = mn_itree_inv_next(interval_sub, &range)) {
> -		ret = interval_sub->ops->invalidate(interval_sub, &range,
> -						    cur_seq);
> +		if (interval_sub->ops->invalidate_multipass) {
> +			struct mmu_interval_notifier_pass *second = NULL;
> +
> +			ret = interval_sub->ops->invalidate_multipass(interval_sub,
> +								      &range,
> +								      cur_seq,
> +								      &second);
> +			if (ret && second)
> +				list_add_tail(&second->link, &additional_passes);
> +
> +		} else {
> +			ret = interval_sub->ops->invalidate(interval_sub,
> +							    &range,
> +							    cur_seq);
> +		}
>  		WARN_ON(!ret);
>  	}
>  
> +	mn_itree_additional_passes(&additional_passes, &range, cur_seq);
>  	mn_itree_inv_end(subscriptions);
>  }
>  
> @@ -431,6 +462,8 @@ static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions,
>  {
>  	struct mmu_interval_notifier *interval_sub;
>  	unsigned long cur_seq;
> +	LIST_HEAD(additional_passes);
> +	int err = 0;
>  
>  	for (interval_sub =
>  		     mn_itree_inv_start_range(subscriptions, range, &cur_seq);
> @@ -438,23 +471,39 @@ static int mn_itree_invalidate(struct mmu_notifier_subscriptions *subscriptions,
>  	     interval_sub = mn_itree_inv_next(interval_sub, range)) {
>  		bool ret;
>  
> -		ret = interval_sub->ops->invalidate(interval_sub, range,
> -						    cur_seq);
> +		if (interval_sub->ops->invalidate_multipass) {
> +			struct mmu_interval_notifier_pass *second = NULL;
> +
> +			ret = interval_sub->ops->invalidate_multipass(interval_sub,
> +								      range,
> +								      cur_seq,
> +								      &second);
> +			if (ret && second)
> +				list_add_tail(&second->link, &additional_passes);
> +
> +		} else {
> +			ret = interval_sub->ops->invalidate(interval_sub,
> +							    range,
> +							    cur_seq);
> +		}
>  		if (!ret) {
>  			if (WARN_ON(mmu_notifier_range_blockable(range)))
>  				continue;
> -			goto out_would_block;
> +			err = -EAGAIN;
> +			break;
>  		}
>  	}
> -	return 0;
>  
> -out_would_block:
> +	mn_itree_additional_passes(&additional_passes, range, cur_seq);
> +
>  	/*
>  	 * On -EAGAIN the non-blocking caller is not allowed to call
>  	 * invalidate_range_end()
>  	 */
> -	mn_itree_inv_end(subscriptions);
> -	return -EAGAIN;
> +	if (err)
> +		mn_itree_inv_end(subscriptions);
> +
> +	return err;
>  }
>  
>  static int mn_hlist_invalidate_range_start(
> -- 
> 2.50.1
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes
  2025-08-19  9:55             ` Alistair Popple
@ 2025-08-19 11:33               ` Thomas Hellström
  2025-08-19 15:35                 ` Matthew Brost
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Hellström @ 2025-08-19 11:33 UTC (permalink / raw)
  To: Alistair Popple, Jason Gunthorpe
  Cc: Matthew Brost, intel-xe, Andrew Morton, Simona Vetter,
	Dave Airlie, dri-devel, linux-mm, linux-kernel,
	Christian König

On Tue, 2025-08-19 at 19:55 +1000, Alistair Popple wrote:
> On Mon, Aug 18, 2025 at 01:46:55PM -0300, Jason Gunthorpe wrote:
> > On Mon, Aug 18, 2025 at 09:44:01AM -0700, Matthew Brost wrote:
> > > On Mon, Aug 18, 2025 at 01:36:17PM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Aug 18, 2025 at 09:25:20AM -0700, Matthew Brost wrote:
> > > > > I think this choice makes sense: it allows embedding the wait
> > > > > state from
> > > > > the initial notifier call into the pass structure. Patch [6]
> > > > > shows this
> > > > > by attaching the issued TLB invalidation fences to the pass.
> > > > > Since a
> > > > > single notifier may be invoked multiple times with different
> > > > > ranges but
> > > > > the same seqno,
> > > > 
> > > > That should be explained, but also seems to be a bit of a
> > > > different
> > > > issue..
> > > > 
> > > > If the design is really to only have two passes and this linked
> > > > list
> > > > is about retaining state then there should not be so much
> > > > freedom to
> > > > have more passes.
> > > 
> > > I’ll let Thomas weigh in on whether we really need more than two
> > > passes;
> > > my feeling is that two passes are likely sufficient. It’s also
> > > worth
> > > noting that the linked list has an added benefit: the notifier
> > > tree only
> > > needs to be walked once (a small time-complexity win).
> > 
> > You may end up keeping the linked list just with no way to add a
> > third
> > pass.
> 
> It seems to me though that linked list still adds unnecessary
> complexity. I
> think this would all be much easier to follow if we just added two
> new callbacks
> - invalidate_start() and invalidate_end() say.

One thing that the linked list avoids, though, is traversing the
interval tree two times. It has O(n*log(n)) whereas the linked list
overhead is just O(n_2pass).

> 
> Admitedly that would still require the linked list (or something
> similar) to
> retain the ability to hold/pass a context between the start and end
> callbacks.
> Which is bit annoying, it's a pity we need to allocate memory in a
> performance
> sensitive path to effectively pass (at least in this case) a single
> pointer. I
> can't think of any obvious solutions to that though.

One idea is for any two-pass notifier implementation to use a small
pool. That would also to some extent mitigate the risk of out-of-memory
with GFP_NOWAIT.

/Thomas


> 
> > Jason
> > 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes
  2025-08-19 10:03   ` Alistair Popple
@ 2025-08-19 11:35     ` Thomas Hellström
  0 siblings, 0 replies; 22+ messages in thread
From: Thomas Hellström @ 2025-08-19 11:35 UTC (permalink / raw)
  To: Alistair Popple
  Cc: intel-xe, Jason Gunthorpe, Andrew Morton, Simona Vetter,
	Dave Airlie, dri-devel, linux-mm, linux-kernel, Matthew Brost,
	Christian König

On Tue, 2025-08-19 at 20:03 +1000, Alistair Popple wrote:
> On Sat, Aug 09, 2025 at 03:51:32PM +0200, Thomas Hellström wrote:
> > GPU use-cases for mmu_interval_notifiers with hmm often involve
> > starting a gpu operation and then waiting for it to complete.
> > These operations are typically context preemption or TLB flushing.
> > 
> > With single-pass notifiers per GPU this doesn't scale in
> > multi-gpu scenarios. In those scenarios we'd want to first start
> > preemption- or TLB flushing on all GPUs and as a second pass wait
> > for them to complete on all gpus.
> > 
> > One can do this on per-driver basis multiplexing per-driver
> > notifiers but that would mean sharing the notifier "user" lock
> > across all GPUs and that doesn't scale well either, so adding
> > support
> > for multi-pass in the core appears like the right choice.
> > 
> > Implement multi-pass capability in the mmu_interval_notifier. Use a
> > linked list for the additional passes to minimize the impact for
> > use-cases that don't need the multi-pass functionality.
> > 
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Simona Vetter <simona.vetter@ffwll.ch>
> > Cc: Dave Airlie <airlied@gmail.com>
> > Cc: <dri-devel@lists.freedesktop.org>
> > Cc: <linux-mm@kvack.org>
> > Cc: <linux-kernel@vger.kernel.org>
> > 
> > Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > ---
> >  include/linux/mmu_notifier.h | 30 ++++++++++++++++
> >  mm/mmu_notifier.c            | 67 +++++++++++++++++++++++++++++++-
> > ----
> >  2 files changed, 88 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/linux/mmu_notifier.h
> > b/include/linux/mmu_notifier.h
> > index d1094c2d5fb6..1107a8eafd8a 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -233,6 +233,32 @@ struct mmu_notifier {
> >  	unsigned int users;
> >  };
> >  
> > +/**
> > + * struct mmu_interval_notifier_pass - mmu_interval_notifier
> > multi-pass abstraction
> > + * @link: List link for the notifiers pending pass list
> > + *
> > + * Allocate, typically using GFP_NOWAIT in the interval notifier's
> > first pass.
> > + * If allocation fails (which is not unlikely under memory
> > pressure), fall back
> > + * to single-pass operation.
> > + */
> > +struct mmu_interval_notifier_pass {
> 
> If we limit the number of passes to two maybe call this
> `mmu_interval_notifier_finish()`? ...
> 
> > +	struct list_head link;
> > +	/**
> > +	 * @pass: Driver callback for additionall pass.
> > +	 * @additional_pass: Pointer to the
> > mmu_interval_notifier_pass structure.
> > +	 * @range: The mmu_notifier_range.
> > +	 * @cur_seq: The current sequence set by the first pass.
> > +	 *
> > +	 * Return: Either a pointer to a valid
> > mmu_interval_notifier_pass for
> > +	 * another pass to be called, or %NULL if processing is
> > complete for this
> > +	 * notifier. There is no error reporting mechanism for
> > additional passes.
> > +	 */
> > +	struct mmu_interval_notifier_pass *
> > +	(*pass) (struct mmu_interval_notifier_pass
> > *additional_pass,
> 

> 
> > +		 const struct mmu_notifier_range *range,
> > +		 unsigned long cur_seq);
> > +};
> > +
> >  /**
> >   * struct mmu_interval_notifier_ops
> >   * @invalidate: Upon return the caller must stop using any SPTEs
> > within this
> > @@ -243,6 +269,10 @@ struct mmu_interval_notifier_ops {
> >  	bool (*invalidate)(struct mmu_interval_notifier
> > *interval_sub,
> >  			   const struct mmu_notifier_range *range,
> >  			   unsigned long cur_seq);
> > +	bool (*invalidate_multipass)(struct mmu_interval_notifier
> > *interval_sub,
> 
> ... and then this could be called `invalidate_start()`. That might
> address some
> of the concerns with naming.

Makes sense. I'll have a look at that.

/Thomas


> 
> > +				     const struct
> > mmu_notifier_range *range,
> > +				     unsigned long cur_seq,
> > +				     struct
> > mmu_interval_notifier_pass **pass);
> >  };
> >  
> >  struct mmu_interval_notifier {
> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > index 8e0125dc0522..dd6af87db103 100644
> > --- a/mm/mmu_notifier.c
> > +++ b/mm/mmu_notifier.c
> > @@ -260,6 +260,22 @@ mmu_interval_read_begin(struct
> > mmu_interval_notifier *interval_sub)
> >  }
> >  EXPORT_SYMBOL_GPL(mmu_interval_read_begin);
> >  
> > +static void mn_itree_additional_passes(struct list_head
> > *additional_passes,
> > +				       const struct
> > mmu_notifier_range *range,
> > +				       unsigned long cur_seq)
> > +{
> > +	struct mmu_interval_notifier_pass *p, *next;
> > +
> > +	while (!list_empty(additional_passes)) {
> > +		list_for_each_entry_safe(p, next,
> > additional_passes, link) {
> > +			list_del_init(&p->link);
> > +			p = p->pass(p, range, cur_seq);
> > +			if (p)
> > +				list_add_tail(&p->link,
> > additional_passes);
> > +		}
> > +	}
> > +}
> > +
> >  static void mn_itree_release(struct mmu_notifier_subscriptions
> > *subscriptions,
> >  			     struct mm_struct *mm)
> >  {
> > @@ -272,17 +288,32 @@ static void mn_itree_release(struct
> > mmu_notifier_subscriptions *subscriptions,
> >  	};
> >  	struct mmu_interval_notifier *interval_sub;
> >  	unsigned long cur_seq;
> > +	LIST_HEAD(additional_passes);
> >  	bool ret;
> >  
> >  	for (interval_sub =
> >  		     mn_itree_inv_start_range(subscriptions,
> > &range, &cur_seq);
> >  	     interval_sub;
> >  	     interval_sub = mn_itree_inv_next(interval_sub,
> > &range)) {
> > -		ret = interval_sub->ops->invalidate(interval_sub,
> > &range,
> > -						    cur_seq);
> > +		if (interval_sub->ops->invalidate_multipass) {
> > +			struct mmu_interval_notifier_pass *second
> > = NULL;
> > +
> > +			ret = interval_sub->ops-
> > >invalidate_multipass(interval_sub,
> > +								  
> >     &range,
> > +								  
> >     cur_seq,
> > +								  
> >     &second);
> > +			if (ret && second)
> > +				list_add_tail(&second->link,
> > &additional_passes);
> > +
> > +		} else {
> > +			ret = interval_sub->ops-
> > >invalidate(interval_sub,
> > +							   
> > &range,
> > +							   
> > cur_seq);
> > +		}
> >  		WARN_ON(!ret);
> >  	}
> >  
> > +	mn_itree_additional_passes(&additional_passes, &range,
> > cur_seq);
> >  	mn_itree_inv_end(subscriptions);
> >  }
> >  
> > @@ -431,6 +462,8 @@ static int mn_itree_invalidate(struct
> > mmu_notifier_subscriptions *subscriptions,
> >  {
> >  	struct mmu_interval_notifier *interval_sub;
> >  	unsigned long cur_seq;
> > +	LIST_HEAD(additional_passes);
> > +	int err = 0;
> >  
> >  	for (interval_sub =
> >  		     mn_itree_inv_start_range(subscriptions,
> > range, &cur_seq);
> > @@ -438,23 +471,39 @@ static int mn_itree_invalidate(struct
> > mmu_notifier_subscriptions *subscriptions,
> >  	     interval_sub = mn_itree_inv_next(interval_sub,
> > range)) {
> >  		bool ret;
> >  
> > -		ret = interval_sub->ops->invalidate(interval_sub,
> > range,
> > -						    cur_seq);
> > +		if (interval_sub->ops->invalidate_multipass) {
> > +			struct mmu_interval_notifier_pass *second
> > = NULL;
> > +
> > +			ret = interval_sub->ops-
> > >invalidate_multipass(interval_sub,
> > +								  
> >     range,
> > +								  
> >     cur_seq,
> > +								  
> >     &second);
> > +			if (ret && second)
> > +				list_add_tail(&second->link,
> > &additional_passes);
> > +
> > +		} else {
> > +			ret = interval_sub->ops-
> > >invalidate(interval_sub,
> > +							    range,
> > +							   
> > cur_seq);
> > +		}
> >  		if (!ret) {
> >  			if
> > (WARN_ON(mmu_notifier_range_blockable(range)))
> >  				continue;
> > -			goto out_would_block;
> > +			err = -EAGAIN;
> > +			break;
> >  		}
> >  	}
> > -	return 0;
> >  
> > -out_would_block:
> > +	mn_itree_additional_passes(&additional_passes, range,
> > cur_seq);
> > +
> >  	/*
> >  	 * On -EAGAIN the non-blocking caller is not allowed to
> > call
> >  	 * invalidate_range_end()
> >  	 */
> > -	mn_itree_inv_end(subscriptions);
> > -	return -EAGAIN;
> > +	if (err)
> > +		mn_itree_inv_end(subscriptions);
> > +
> > +	return err;
> >  }
> >  
> >  static int mn_hlist_invalidate_range_start(
> > -- 
> > 2.50.1
> > 
> > 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes
  2025-08-19 11:33               ` Thomas Hellström
@ 2025-08-19 15:35                 ` Matthew Brost
  2025-08-21  9:34                   ` Thomas Hellström
  0 siblings, 1 reply; 22+ messages in thread
From: Matthew Brost @ 2025-08-19 15:35 UTC (permalink / raw)
  To: Thomas Hellström
  Cc: Alistair Popple, Jason Gunthorpe, intel-xe, Andrew Morton,
	Simona Vetter, Dave Airlie, dri-devel, linux-mm, linux-kernel,
	Christian König

On Tue, Aug 19, 2025 at 01:33:40PM +0200, Thomas Hellström wrote:
> On Tue, 2025-08-19 at 19:55 +1000, Alistair Popple wrote:
> > On Mon, Aug 18, 2025 at 01:46:55PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Aug 18, 2025 at 09:44:01AM -0700, Matthew Brost wrote:
> > > > On Mon, Aug 18, 2025 at 01:36:17PM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Aug 18, 2025 at 09:25:20AM -0700, Matthew Brost wrote:
> > > > > > I think this choice makes sense: it allows embedding the wait
> > > > > > state from
> > > > > > the initial notifier call into the pass structure. Patch [6]
> > > > > > shows this
> > > > > > by attaching the issued TLB invalidation fences to the pass.
> > > > > > Since a
> > > > > > single notifier may be invoked multiple times with different
> > > > > > ranges but
> > > > > > the same seqno,
> > > > > 
> > > > > That should be explained, but also seems to be a bit of a
> > > > > different
> > > > > issue..
> > > > > 
> > > > > If the design is really to only have two passes and this linked
> > > > > list
> > > > > is about retaining state then there should not be so much
> > > > > freedom to
> > > > > have more passes.
> > > > 
> > > > I’ll let Thomas weigh in on whether we really need more than two
> > > > passes;
> > > > my feeling is that two passes are likely sufficient. It’s also
> > > > worth
> > > > noting that the linked list has an added benefit: the notifier
> > > > tree only
> > > > needs to be walked once (a small time-complexity win).
> > > 
> > > You may end up keeping the linked list just with no way to add a
> > > third
> > > pass.
> > 
> > It seems to me though that linked list still adds unnecessary
> > complexity. I
> > think this would all be much easier to follow if we just added two
> > new callbacks
> > - invalidate_start() and invalidate_end() say.
> 
> One thing that the linked list avoids, though, is traversing the
> interval tree two times. It has O(n*log(n)) whereas the linked list
> overhead is just O(n_2pass).
> 
> > 
> > Admitedly that would still require the linked list (or something
> > similar) to
> > retain the ability to hold/pass a context between the start and end
> > callbacks.
> > Which is bit annoying, it's a pity we need to allocate memory in a
> > performance
> > sensitive path to effectively pass (at least in this case) a single
> > pointer. I
> > can't think of any obvious solutions to that though.
> 
> One idea is for any two-pass notifier implementation to use a small
> pool. That would also to some extent mitigate the risk of out-of-memory
> with GFP_NOWAIT.
> 

I think we can attach a preallocated list entry to the driver-side
notifier state; then you’d only need to allocate (or block) if that
notifier is invoked more than once while a wait action (e.g., a TLB
invalidation) is outstanding. Multiple invocations are technically
possible, but in practice I’d expect them to be rare.

I’m not sure how much of a win this is, though. On Intel hardware, TLB
invalidations are several orders of magnitude slower than the software
steps our notifiers perform. Ultimately, whether to allocate or
preallocate is a driver-side choice.

Matt

> /Thomas
> 
> 
> > 
> > > Jason
> > > 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes
  2025-08-19 15:35                 ` Matthew Brost
@ 2025-08-21  9:34                   ` Thomas Hellström
  0 siblings, 0 replies; 22+ messages in thread
From: Thomas Hellström @ 2025-08-21  9:34 UTC (permalink / raw)
  To: Matthew Brost
  Cc: Alistair Popple, Jason Gunthorpe, intel-xe, Andrew Morton,
	Simona Vetter, Dave Airlie, dri-devel, linux-mm, linux-kernel,
	Christian König

On Tue, 2025-08-19 at 08:35 -0700, Matthew Brost wrote:
> On Tue, Aug 19, 2025 at 01:33:40PM +0200, Thomas Hellström wrote:
> > On Tue, 2025-08-19 at 19:55 +1000, Alistair Popple wrote:
> > > On Mon, Aug 18, 2025 at 01:46:55PM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Aug 18, 2025 at 09:44:01AM -0700, Matthew Brost wrote:
> > > > > On Mon, Aug 18, 2025 at 01:36:17PM -0300, Jason Gunthorpe
> > > > > wrote:
> > > > > > On Mon, Aug 18, 2025 at 09:25:20AM -0700, Matthew Brost
> > > > > > wrote:
> > > > > > > I think this choice makes sense: it allows embedding the
> > > > > > > wait
> > > > > > > state from
> > > > > > > the initial notifier call into the pass structure. Patch
> > > > > > > [6]
> > > > > > > shows this
> > > > > > > by attaching the issued TLB invalidation fences to the
> > > > > > > pass.
> > > > > > > Since a
> > > > > > > single notifier may be invoked multiple times with
> > > > > > > different
> > > > > > > ranges but
> > > > > > > the same seqno,
> > > > > > 
> > > > > > That should be explained, but also seems to be a bit of a
> > > > > > different
> > > > > > issue..
> > > > > > 
> > > > > > If the design is really to only have two passes and this
> > > > > > linked
> > > > > > list
> > > > > > is about retaining state then there should not be so much
> > > > > > freedom to
> > > > > > have more passes.
> > > > > 
> > > > > I’ll let Thomas weigh in on whether we really need more than
> > > > > two
> > > > > passes;
> > > > > my feeling is that two passes are likely sufficient. It’s
> > > > > also
> > > > > worth
> > > > > noting that the linked list has an added benefit: the
> > > > > notifier
> > > > > tree only
> > > > > needs to be walked once (a small time-complexity win).
> > > > 
> > > > You may end up keeping the linked list just with no way to add
> > > > a
> > > > third
> > > > pass.
> > > 
> > > It seems to me though that linked list still adds unnecessary
> > > complexity. I
> > > think this would all be much easier to follow if we just added
> > > two
> > > new callbacks
> > > - invalidate_start() and invalidate_end() say.
> > 
> > One thing that the linked list avoids, though, is traversing the
> > interval tree two times. It has O(n*log(n)) whereas the linked list
> > overhead is just O(n_2pass).
> > 
> > > 
> > > Admitedly that would still require the linked list (or something
> > > similar) to
> > > retain the ability to hold/pass a context between the start and
> > > end
> > > callbacks.
> > > Which is bit annoying, it's a pity we need to allocate memory in
> > > a
> > > performance
> > > sensitive path to effectively pass (at least in this case) a
> > > single
> > > pointer. I
> > > can't think of any obvious solutions to that though.
> > 
> > One idea is for any two-pass notifier implementation to use a small
> > pool. That would also to some extent mitigate the risk of out-of-
> > memory
> > with GFP_NOWAIT.
> > 
> 
> I think we can attach a preallocated list entry to the driver-side
> notifier state; then you’d only need to allocate (or block) if that
> notifier is invoked more than once while a wait action (e.g., a TLB
> invalidation) is outstanding. Multiple invocations are technically
> possible, but in practice I’d expect them to be rare.
> 
> I’m not sure how much of a win this is, though. On Intel hardware,
> TLB
> invalidations are several orders of magnitude slower than the
> software
> steps our notifiers perform. Ultimately, whether to allocate or
> preallocate is a driver-side choice.

I agree we shouldn't enforce anything at this point. But if we envision
a situation where multiple subsystem two-pass notifiers subscribe, the
GFP_NOWAIT memory might be exhausted by the notifiers called first. A
greedy behavior that might eventually cause serialization anyway.

So to behave nicely towards other notifier subscriptions, an
implementation should ideally have something pre-allocated.

/Thomas


> 
> Matt
> 
> > /Thomas
> > 
> > 
> > > 
> > > > Jason
> > > > 
> > 


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-08-21  9:34 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-09 13:51 [RFC PATCH 0/6] Multi-pass MMU interval notifiers Thomas Hellström
2025-08-09 13:51 ` [RFC PATCH 1/6] mm/mmu_notifier: Allow multiple struct mmu_interval_notifier passes Thomas Hellström
2025-08-18 16:07   ` Jason Gunthorpe
2025-08-18 16:25     ` Matthew Brost
2025-08-18 16:36       ` Jason Gunthorpe
2025-08-18 16:42         ` Thomas Hellström
2025-08-18 16:45           ` Matthew Brost
2025-08-18 16:44         ` Matthew Brost
2025-08-18 16:46           ` Jason Gunthorpe
2025-08-19  9:55             ` Alistair Popple
2025-08-19 11:33               ` Thomas Hellström
2025-08-19 15:35                 ` Matthew Brost
2025-08-21  9:34                   ` Thomas Hellström
2025-08-19 10:03   ` Alistair Popple
2025-08-19 11:35     ` Thomas Hellström
2025-08-09 13:51 ` [RFC PATCH 2/6] drm/gpusvm: Update GPU SVM / Xe to twopass MMU notifier Thomas Hellström
2025-08-09 13:51 ` [RFC PATCH 3/6] drm/gpusvm: Add drm_gpusvm_in_notifier_* helpers Thomas Hellström
2025-08-09 13:51 ` [RFC PATCH 4/6] drm/xe: Skip waiting on unarmed fences in xe_gt_tlb_invalidation_fence_wait Thomas Hellström
2025-08-09 13:51 ` [RFC PATCH 5/6] drm/xe: Add fences argument to xe_vm_range_tilemask_tlb_invalidation Thomas Hellström
2025-08-09 13:51 ` [RFC PATCH 6/6] drm/xe: Implement two pass MMU notifiers for SVM Thomas Hellström
2025-08-11 20:46   ` Matthew Brost
2025-08-12  9:06     ` Thomas Hellström

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).