[PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
@ 2026-06-25 10:59 Yitao Jiang
  2026-06-25 10:59 ` [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP Yitao Jiang
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Yitao Jiang @ 2026-06-25 10:59 UTC (permalink / raw)
  To: Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Felix Kuehling, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
	linux-kernel, linux-mm, Yitao Jiang

Hi,

This series fixes a THP policy problem I found while debugging
frequent ROCm GPU failures on an AMD Radeon 780M system during ML
training.

Some AMDGPU/KFD user mappings are registered through interval
notifiers and cannot safely tolerate the backing VMA changing from base
pages to a transparent huge page after registration. Userspace can
still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
collapse the range, after the GPU mapping has been registered.

On my system this showed up as asynchronous ROCm/HIP kernel launch
failures, often reported later at a synchronization or copy point. I
expect the issue to be relevant to AMDGPU/KFD mappings on
XNACK-disabled GPUs more generally, because those mappings cannot rely
on replayable GPU faults after a CPU-side THP remap. I have validated
the failure and fix on AMD Radeon 780M / gfx1103.

Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
users can ask the MM core to keep the covered VMA range out of THP
while the notifier is active. The MM core applies VM_NOHUGEPAGE and
clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
over an active opt-in range is treated as an ignored hint, and
MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.

Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
current behavior.

This does not disable THP globally and does not add work to GPU
command submission or kernel launch paths. Additional work is limited
to opt-in notifier registration, opt-in notifier flag transitions, and
MADV_HUGEPAGE attempts that overlap an active opt-in range.

I tested this on top of torvalds/linux commit ab9de95c9cf9 with:

  - scripts/checkpatch.pl --strict --no-tree
  - git apply --check
  - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
    DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
  - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
    originally exposed the failure on my Radeon 780M system

The standalone reproducers depend on ROCm userspace libraries, so I
have not included them in this series. I can send them separately if
useful.

This series was prepared with assistance from OpenAI Codex (GPT-5.5).
I reviewed the resulting code and take responsibility for the
submission.

Yitao Jiang (3):
  mm/mmu_notifier: let interval notifiers block THP
  drm/amdgpu: block THP for HSA userptr notifiers
  drm/amdkfd: block THP for non-replayable SVM ranges

 drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c |  25 ++-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c    |  36 ++++-
 include/linux/huge_mm.h                 |   5 +-
 include/linux/mmu_notifier.h            |  28 ++++
 mm/khugepaged.c                         |   9 +-
 mm/madvise.c                            |   3 +-
 mm/mmu_notifier.c                       | 204 +++++++++++++++++++++++-
 7 files changed, 286 insertions(+), 24 deletions(-)

base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
-- 
2.53.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP
  2026-06-25 10:59 [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings Yitao Jiang
@ 2026-06-25 10:59 ` Yitao Jiang
  2026-06-25 11:50   ` David Hildenbrand (Arm)
  2026-06-25 11:58   ` Lorenzo Stoakes
  2026-06-25 10:59 ` [PATCH 2/3] drm/amdgpu: block THP for HSA userptr notifiers Yitao Jiang
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 14+ messages in thread
From: Yitao Jiang @ 2026-06-25 10:59 UTC (permalink / raw)
  To: Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Felix Kuehling, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
	linux-kernel, linux-mm, Yitao Jiang

Some secondary MMUs cannot safely tolerate a user VMA becoming backed
by transparent huge pages after the range has been registered with an
interval notifier. Drivers can observe the page-table layout change
through invalidations, but devices without replayable faults, or ranges
that must stay mapped, cannot necessarily re-establish coherent device
mappings before later device access.

Add MMU_INTERVAL_NOTIFIER_BLOCK_THP so a driver can declare this
property when registering an interval notifier. The MM core then marks
the covered VMA range VM_NOHUGEPAGE and clears VM_HUGEPAGE while
holding mmap_lock for write. A later MADV_HUGEPAGE on the same active
range is treated as an ignored hint, leaving the MM-owned nohuge
policy intact. MADV_COLLAPSE already rejects VM_NOHUGEPAGE VMAs.

This keeps the policy in MM code instead of requiring device drivers
to edit VMA THP flags directly, and it only affects opt-in notifier
ranges at registration or flag-transition time.

Assisted-by: OpenAI-Codex:GPT-5.5
Signed-off-by: Yitao Jiang <jytscientist@hotmail.com>
---
 include/linux/huge_mm.h      |   5 +-
 include/linux/mmu_notifier.h |  28 +++++
 mm/khugepaged.c              |   9 +-
 mm/madvise.c                 |   3 +-
 mm/mmu_notifier.c            | 204 +++++++++++++++++++++++++++++++++--
 5 files changed, 237 insertions(+), 12 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad20f7f8c..3dae515ff 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -489,8 +489,8 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			__split_huge_pud(__vma, __pud, __address);	\
 	}  while (0)
 
-int hugepage_madvise(struct vm_area_struct *vma, vm_flags_t *vm_flags,
-		     int advice);
+int hugepage_madvise(struct vm_area_struct *vma, unsigned long start,
+		     unsigned long end, vm_flags_t *vm_flags, int advice);
 int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 		     unsigned long end, bool *lock_dropped);
 void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
@@ -694,6 +694,7 @@ static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
 	do { } while (0)
 
 static inline int hugepage_madvise(struct vm_area_struct *vma,
+				   unsigned long start, unsigned long end,
 				   vm_flags_t *vm_flags, int advice)
 {
 	return -EINVAL;
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index a11a44eef..4accfb65f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -293,8 +293,16 @@ struct mmu_interval_notifier {
 	struct mm_struct *mm;
 	struct hlist_node deferred_item;
 	unsigned long invalidate_seq;
+	unsigned int flags;
 };
 
+/*
+ * The interval range cannot safely be backed by transparent huge pages while
+ * the notifier is active. The MM core owns the VMA policy change so drivers
+ * do not have to manipulate VM_HUGEPAGE/VM_NOHUGEPAGE directly.
+ */
+#define MMU_INTERVAL_NOTIFIER_BLOCK_THP BIT(0)
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 #ifdef CONFIG_LOCKDEP
@@ -347,7 +355,20 @@ int mmu_interval_notifier_insert_locked(
 	struct mmu_interval_notifier *interval_sub, struct mm_struct *mm,
 	unsigned long start, unsigned long length,
 	const struct mmu_interval_notifier_ops *ops);
+int
+mmu_interval_notifier_insert_locked_flags(struct mmu_interval_notifier *interval_sub,
+					  struct mm_struct *mm,
+					  unsigned long start,
+					  unsigned long length,
+					  const struct mmu_interval_notifier_ops *ops,
+					  unsigned int flags);
+int
+mmu_interval_notifier_set_flags_locked(struct mmu_interval_notifier *interval_sub,
+				       unsigned int flags);
 void mmu_interval_notifier_remove(struct mmu_interval_notifier *interval_sub);
+bool mmu_interval_notifier_range_block_thp(struct mm_struct *mm,
+					   unsigned long start,
+					   unsigned long end);
 
 /**
  * mmu_interval_set_seq - Save the invalidation sequence
@@ -637,6 +658,13 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
 {
 }
 
+static inline bool mmu_interval_notifier_range_block_thp(struct mm_struct *mm,
+							 unsigned long start,
+							 unsigned long end)
+{
+	return false;
+}
+
 #define mmu_notifier_range_update_to_read_only(r) false
 
 static inline void mmu_notifier_synchronize(void)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 617bca76d..a9b05e716 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -445,11 +445,16 @@ static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
 	return khugepaged_max_ptes_swap;
 }
 
-int hugepage_madvise(struct vm_area_struct *vma,
-		     vm_flags_t *vm_flags, int advice)
+int hugepage_madvise(struct vm_area_struct *vma, unsigned long start,
+		     unsigned long end, vm_flags_t *vm_flags, int advice)
 {
 	switch (advice) {
 	case MADV_HUGEPAGE:
+		if ((*vm_flags & VM_NOHUGEPAGE) &&
+		    mmu_interval_notifier_range_block_thp(vma->vm_mm,
+							  start, end))
+			return 0;
+
 		*vm_flags &= ~VM_NOHUGEPAGE;
 		*vm_flags |= VM_HUGEPAGE;
 		/*
diff --git a/mm/madvise.c b/mm/madvise.c
index cd9bb0770..c7cee4fcf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1416,7 +1416,8 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
 		break;
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
-		error = hugepage_madvise(vma, &new_flags, behavior);
+		error = hugepage_madvise(vma, range->start, range->end,
+					 &new_flags, behavior);
 		if (error)
 			goto out;
 		break;
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 245b74f39..852a5682b 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -581,6 +581,49 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 	return 0;
 }
 
+/**
+ * mmu_interval_notifier_range_block_thp - check if a range must not use THP
+ * @mm: mm_struct to check
+ * @start: start address
+ * @end: end address
+ *
+ * Return true if an active interval notifier covering the range requested
+ * MMU_INTERVAL_NOTIFIER_BLOCK_THP.
+ */
+bool mmu_interval_notifier_range_block_thp(struct mm_struct *mm,
+					   unsigned long start,
+					   unsigned long end)
+{
+	struct mmu_notifier_subscriptions *subscriptions;
+	struct mmu_interval_notifier *interval_sub;
+	struct interval_tree_node *node;
+	bool block_thp = false;
+
+	if (start >= end)
+		return false;
+
+	/* Pairs with the store in mmu_notifier_register(). */
+	subscriptions = smp_load_acquire(&mm->notifier_subscriptions);
+	if (!subscriptions || !subscriptions->has_itree)
+		return false;
+
+	spin_lock(&subscriptions->lock);
+	for (node = interval_tree_iter_first(&subscriptions->itree, start,
+					     end - 1);
+	     node;
+	     node = interval_tree_iter_next(node, start, end - 1)) {
+		interval_sub = container_of(node, struct mmu_interval_notifier,
+					    interval_tree);
+		if (interval_sub->flags & MMU_INTERVAL_NOTIFIER_BLOCK_THP) {
+			block_thp = true;
+			break;
+		}
+	}
+	spin_unlock(&subscriptions->lock);
+
+	return block_thp;
+}
+
 static void
 mn_hlist_invalidate_end(struct mmu_notifier_subscriptions *subscriptions,
 			struct mmu_notifier_range *range)
@@ -933,13 +976,69 @@ void mmu_notifier_put(struct mmu_notifier *subscription)
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_put);
 
+#define MMU_INTERVAL_NOTIFIER_KNOWN_FLAGS \
+	(MMU_INTERVAL_NOTIFIER_BLOCK_THP)
+
+static int mmu_interval_notifier_check_flags(unsigned int flags)
+{
+	if (flags & ~MMU_INTERVAL_NOTIFIER_KNOWN_FLAGS)
+		return -EINVAL;
+	return 0;
+}
+
+static int
+mmu_interval_notifier_block_thp_locked(struct mm_struct *mm,
+				       unsigned long start,
+				       unsigned long end)
+{
+	struct vm_area_struct *vma, *prev;
+	struct vma_iterator vmi;
+
+	mmap_assert_write_locked(mm);
+
+	vma_iter_init(&vmi, mm, start);
+	vma = vma_iter_load(&vmi);
+	prev = vma_prev(&vmi);
+	if (vma && start > vma->vm_start)
+		prev = vma;
+
+	for_each_vma_range(vmi, vma, end) {
+		const unsigned long curr_start = max(vma->vm_start, start);
+		const unsigned long curr_end = min(vma->vm_end, end);
+		vma_flags_t new_flags;
+
+		if (vma->vm_flags & VM_NO_KHUGEPAGED)
+			goto next;
+
+		new_flags = vma->flags;
+		vma_flags_set(&new_flags, VMA_NOHUGEPAGE_BIT);
+		vma_flags_clear(&new_flags, VMA_HUGEPAGE_BIT);
+		if (vma_flags_same_pair(&new_flags, &vma->flags))
+			goto next;
+
+		vma = vma_modify_flags(&vmi, prev, vma, curr_start,
+				       curr_end, &new_flags);
+		if (IS_ERR(vma))
+			return PTR_ERR(vma);
+
+		vma_start_write(vma);
+		vma->flags = new_flags;
+next:
+		prev = vma;
+	}
+
+	return 0;
+}
+
 static int __mmu_interval_notifier_insert(
 	struct mmu_interval_notifier *interval_sub, struct mm_struct *mm,
 	struct mmu_notifier_subscriptions *subscriptions, unsigned long start,
-	unsigned long length, const struct mmu_interval_notifier_ops *ops)
+	unsigned long length, const struct mmu_interval_notifier_ops *ops,
+	unsigned int flags)
 {
 	interval_sub->mm = mm;
 	interval_sub->ops = ops;
+	interval_sub->flags = flags;
 	RB_CLEAR_NODE(&interval_sub->interval_tree.rb);
 	interval_sub->interval_tree.start = start;
 	/*
@@ -1034,32 +1133,123 @@ int mmu_interval_notifier_insert(struct mmu_interval_notifier *interval_sub,
 		subscriptions = mm->notifier_subscriptions;
 	}
 	return __mmu_interval_notifier_insert(interval_sub, mm, subscriptions,
-					      start, length, ops);
+					      start, length, ops, 0);
 }
 EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert);
 
-int mmu_interval_notifier_insert_locked(
-	struct mmu_interval_notifier *interval_sub, struct mm_struct *mm,
-	unsigned long start, unsigned long length,
-	const struct mmu_interval_notifier_ops *ops)
+/**
+ * mmu_interval_notifier_insert_locked_flags - Insert an interval notifier
+ * @interval_sub: Interval subscription to register
+ * @mm: mm_struct to attach to
+ * @start: Starting virtual address to monitor
+ * @length: Length of the range to monitor
+ * @ops: Interval notifier operations to be called on matching events
+ * @flags: MMU_INTERVAL_NOTIFIER_* flags
+ *
+ * Like mmu_interval_notifier_insert_locked(), but lets callers request
+ * additional MM-owned policy for the interval while holding mmap_lock for
+ * write.
+ */
+int
+mmu_interval_notifier_insert_locked_flags(struct mmu_interval_notifier *interval_sub,
+					  struct mm_struct *mm,
+					  unsigned long start,
+					  unsigned long length,
+					  const struct mmu_interval_notifier_ops *ops,
+					  unsigned int flags)
 {
 	struct mmu_notifier_subscriptions *subscriptions =
 		mm->notifier_subscriptions;
+	unsigned long end;
 	int ret;
 
 	mmap_assert_write_locked(mm);
 
+	ret = mmu_interval_notifier_check_flags(flags);
+	if (ret)
+		return ret;
+
+	if (flags & MMU_INTERVAL_NOTIFIER_BLOCK_THP) {
+		if (length == 0 || check_add_overflow(start, length, &end))
+			return -EOVERFLOW;
+	}
+
 	if (!subscriptions || !subscriptions->has_itree) {
 		ret = __mmu_notifier_register(NULL, mm);
 		if (ret)
 			return ret;
 		subscriptions = mm->notifier_subscriptions;
 	}
+
+	if (flags & MMU_INTERVAL_NOTIFIER_BLOCK_THP) {
+		ret = mmu_interval_notifier_block_thp_locked(mm, start, end);
+		if (ret)
+			return ret;
+	}
+
 	return __mmu_interval_notifier_insert(interval_sub, mm, subscriptions,
-					      start, length, ops);
+					      start, length, ops, flags);
+}
+EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_locked_flags);
+
+int mmu_interval_notifier_insert_locked(struct mmu_interval_notifier *interval_sub,
+					struct mm_struct *mm,
+					unsigned long start,
+					unsigned long length,
+					const struct mmu_interval_notifier_ops *ops)
+{
+	return mmu_interval_notifier_insert_locked_flags(interval_sub, mm,
+							 start, length,
+							 ops, 0);
 }
 EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_locked);
 
+/**
+ * mmu_interval_notifier_set_flags_locked - update an interval notifier's flags
+ * @interval_sub: Interval subscription to update
+ * @flags: MMU_INTERVAL_NOTIFIER_* flags
+ *
+ * Update MMU interval notifier flags while holding mmap_lock for write. When
+ * enabling MMU_INTERVAL_NOTIFIER_BLOCK_THP, the MM core first updates the VMA
+ * THP policy for the notifier's address range.
+ */
+int
+mmu_interval_notifier_set_flags_locked(struct mmu_interval_notifier *interval_sub,
+				       unsigned int flags)
+{
+	struct mm_struct *mm = interval_sub->mm;
+	unsigned long start = interval_sub->interval_tree.start;
+	unsigned long end;
+	int ret;
+
+	ret = mmu_interval_notifier_check_flags(flags);
+	if (ret)
+		return ret;
+
+	if (WARN_ON_ONCE(!mm))
+		return -EINVAL;
+
+	mmap_assert_write_locked(mm);
+
+	if ((flags & MMU_INTERVAL_NOTIFIER_BLOCK_THP) &&
+	    !(interval_sub->flags & MMU_INTERVAL_NOTIFIER_BLOCK_THP)) {
+		if (interval_sub->interval_tree.last == ULONG_MAX)
+			return -EOVERFLOW;
+		end = interval_sub->interval_tree.last + 1;
+
+		ret = mmu_interval_notifier_block_thp_locked(mm, start, end);
+		if (ret)
+			return ret;
+	}
+
+	spin_lock(&mm->notifier_subscriptions->lock);
+	interval_sub->flags = flags;
+	spin_unlock(&mm->notifier_subscriptions->lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(mmu_interval_notifier_set_flags_locked);
+
 static bool
 mmu_interval_seq_released(struct mmu_notifier_subscriptions *subscriptions,
 			  unsigned long seq)
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP
  2026-06-25 10:59 ` [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP Yitao Jiang
@ 2026-06-25 11:50   ` David Hildenbrand (Arm)
  2026-06-25 11:58   ` Lorenzo Stoakes
  1 sibling, 0 replies; 14+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-25 11:50 UTC (permalink / raw)
  To: Yitao Jiang, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Felix Kuehling, Andrew Morton, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
	linux-kernel, linux-mm

On 6/25/26 12:59, Yitao Jiang wrote:
> Some secondary MMUs cannot safely tolerate a user VMA becoming backed
> by transparent huge pages after the range has been registered with an
> interval notifier. Drivers can observe the page-table layout change
> through invalidations, but devices without replayable faults, or ranges
> that must stay mapped,

Then you shouldn't be using MMU notifiers.

Use good old nasty page pinning. :)

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP
  2026-06-25 10:59 ` [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP Yitao Jiang
  2026-06-25 11:50   ` David Hildenbrand (Arm)
@ 2026-06-25 11:58   ` Lorenzo Stoakes
  1 sibling, 0 replies; 14+ messages in thread
From: Lorenzo Stoakes @ 2026-06-25 11:58 UTC (permalink / raw)
  To: Yitao Jiang
  Cc: Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Felix Kuehling, Andrew Morton, David Hildenbrand, Zi Yan,
	Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
	linux-kernel, linux-mm

On Thu, Jun 25, 2026 at 06:59:51PM +0800, Yitao Jiang wrote:
> Assisted-by: OpenAI-Codex:GPT-5.5

Thanks for acking AI involvement, that's appreciated (and there appears to be a
fair but of unacknowledged AI-generated code being submitted at the moment).

However, may I gnetly direct you towards the last few paragraphs in this document:

https://origin.kernel.org/doc/html/latest/process/generated-content.html

I think this MAY be a case of the AI possibly misleading you into a crazy idea
when, as David points out, page pinning is what you need :)

We don't bite, if you have a problem that you need to solve, feel free to email
linux-mm@kvack.org and relevant maintainers/reviewers from MAINTAINERS with a
'[DISCUSSION]' or '[QUESTION]'-prefixed thread and we can help you ahead of
time.

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/3] drm/amdgpu: block THP for HSA userptr notifiers
  2026-06-25 10:59 [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings Yitao Jiang
  2026-06-25 10:59 ` [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP Yitao Jiang
@ 2026-06-25 10:59 ` Yitao Jiang
  2026-06-25 12:36   ` Christian König
  2026-06-25 10:59 ` [PATCH 3/3] drm/amdkfd: block THP for non-replayable SVM ranges Yitao Jiang
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Yitao Jiang @ 2026-06-25 10:59 UTC (permalink / raw)
  To: Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Felix Kuehling, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
	linux-kernel, linux-mm, Yitao Jiang

HSA userptr buffer objects are used by KFD compute queues. On systems
where the GPU cannot reliably tolerate a CPU THP remap of an active
userptr range, allowing khugepaged or MADV_COLLAPSE to replace PTE
mappings with a PMD mapping can leave later GPU work failing
asynchronously.

Register HSA userptr interval notifiers with
MMU_INTERVAL_NOTIFIER_BLOCK_THP. GFX userptrs keep the existing
notifier path and do not opt in.

Assisted-by: OpenAI-Codex:GPT-5.5
Signed-off-by: Yitao Jiang <jytscientist@hotmail.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
index 99bc9ad67..c0b36164c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
@@ -44,6 +44,7 @@
  */
 
 #include <linux/firmware.h>
+#include <linux/mm.h>
 #include <linux/module.h>
 #include <drm/drm.h>
 
@@ -130,16 +131,24 @@ static const struct mmu_interval_notifier_ops amdgpu_hmm_hsa_ops = {
  */
 int amdgpu_hmm_register(struct amdgpu_bo *bo, unsigned long addr)
 {
+	struct mm_struct *mm = current->mm;
+	unsigned long size = amdgpu_bo_size(bo);
 	int r;
 
-	if (bo->kfd_bo)
-		r = mmu_interval_notifier_insert(&bo->notifier, current->mm,
-						    addr, amdgpu_bo_size(bo),
-						    &amdgpu_hmm_hsa_ops);
-	else
-		r = mmu_interval_notifier_insert(&bo->notifier, current->mm, addr,
-							amdgpu_bo_size(bo),
-							&amdgpu_hmm_gfx_ops);
+	if (unlikely(!mm))
+		return -ESRCH;
+
+	if (bo->kfd_bo) {
+		mmap_write_lock(mm);
+		r = mmu_interval_notifier_insert_locked_flags(&bo->notifier, mm,
+							      addr, size,
+							      &amdgpu_hmm_hsa_ops,
+							      MMU_INTERVAL_NOTIFIER_BLOCK_THP);
+		mmap_write_unlock(mm);
+	} else {
+		r = mmu_interval_notifier_insert(&bo->notifier, mm, addr, size,
+						 &amdgpu_hmm_gfx_ops);
+	}
 	if (r)
 		/*
 		 * Make sure amdgpu_hmm_unregister() doesn't call
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] drm/amdgpu: block THP for HSA userptr notifiers
  2026-06-25 10:59 ` [PATCH 2/3] drm/amdgpu: block THP for HSA userptr notifiers Yitao Jiang
@ 2026-06-25 12:36   ` Christian König
  0 siblings, 0 replies; 14+ messages in thread
From: Christian König @ 2026-06-25 12:36 UTC (permalink / raw)
  To: Yitao Jiang, Alex Deucher, David Airlie, Simona Vetter,
	Felix Kuehling, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
	linux-kernel, linux-mm

On 6/25/26 12:59, Yitao Jiang wrote:
> [Some people who received this message don't often get email from jytscientist@hotmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> HSA userptr buffer objects are used by KFD compute queues. On systems
> where the GPU cannot reliably tolerate a CPU THP remap of an active
> userptr range, allowing khugepaged or MADV_COLLAPSE to replace PTE
> mappings with a PMD mapping can leave later GPU work failing
> asynchronously.

Absolutely clear NAK to this.

That largely sounds like it just work around some issue and is not really a doable fix.

Regards,
Christian.

> 
> Register HSA userptr interval notifiers with
> MMU_INTERVAL_NOTIFIER_BLOCK_THP. GFX userptrs keep the existing
> notifier path and do not opt in.
> 
> Assisted-by: OpenAI-Codex:GPT-5.5
> Signed-off-by: Yitao Jiang <jytscientist@hotmail.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 25 +++++++++++++++++--------
>  1 file changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> index 99bc9ad67..c0b36164c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> @@ -44,6 +44,7 @@
>   */
> 
>  #include <linux/firmware.h>
> +#include <linux/mm.h>
>  #include <linux/module.h>
>  #include <drm/drm.h>
> 
> @@ -130,16 +131,24 @@ static const struct mmu_interval_notifier_ops amdgpu_hmm_hsa_ops = {
>   */
>  int amdgpu_hmm_register(struct amdgpu_bo *bo, unsigned long addr)
>  {
> +       struct mm_struct *mm = current->mm;
> +       unsigned long size = amdgpu_bo_size(bo);
>         int r;
> 
> -       if (bo->kfd_bo)
> -               r = mmu_interval_notifier_insert(&bo->notifier, current->mm,
> -                                                   addr, amdgpu_bo_size(bo),
> -                                                   &amdgpu_hmm_hsa_ops);
> -       else
> -               r = mmu_interval_notifier_insert(&bo->notifier, current->mm, addr,
> -                                                       amdgpu_bo_size(bo),
> -                                                       &amdgpu_hmm_gfx_ops);
> +       if (unlikely(!mm))
> +               return -ESRCH;
> +
> +       if (bo->kfd_bo) {
> +               mmap_write_lock(mm);
> +               r = mmu_interval_notifier_insert_locked_flags(&bo->notifier, mm,
> +                                                             addr, size,
> +                                                             &amdgpu_hmm_hsa_ops,
> +                                                             MMU_INTERVAL_NOTIFIER_BLOCK_THP);
> +               mmap_write_unlock(mm);
> +       } else {
> +               r = mmu_interval_notifier_insert(&bo->notifier, mm, addr, size,
> +                                                &amdgpu_hmm_gfx_ops);
> +       }
>         if (r)
>                 /*
>                  * Make sure amdgpu_hmm_unregister() doesn't call
> --
> 2.53.0
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3/3] drm/amdkfd: block THP for non-replayable SVM ranges
  2026-06-25 10:59 [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings Yitao Jiang
  2026-06-25 10:59 ` [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP Yitao Jiang
  2026-06-25 10:59 ` [PATCH 2/3] drm/amdgpu: block THP for HSA userptr notifiers Yitao Jiang
@ 2026-06-25 10:59 ` Yitao Jiang
  2026-06-25 11:47 ` [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings David Hildenbrand (Arm)
  2026-06-25 12:35 ` Christian König
  4 siblings, 0 replies; 14+ messages in thread
From: Yitao Jiang @ 2026-06-25 10:59 UTC (permalink / raw)
  To: Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Felix Kuehling, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
	linux-kernel, linux-mm, Yitao Jiang

KFD SVM ranges on processes without XNACK, and ranges requested as
GPU_ALWAYS_MAPPED, cannot rely on replayable GPU faults after a CPU THP
remap of the registered VA range. Keep those ranges backed by base
pages while their interval notifier is active.

Opt those SVM interval notifiers into MMU_INTERVAL_NOTIFIER_BLOCK_THP
and update the flag when SVM attributes change. XNACK-enabled ranges
that can handle remaps through replayable faults remain eligible for
THP unless GPU_ALWAYS_MAPPED is requested.

Assisted-by: OpenAI-Codex:GPT-5.5
Signed-off-by: Yitao Jiang <jytscientist@hotmail.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 36 ++++++++++++++++++++++++----
 1 file changed, 32 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 3841943da..0d0feba7b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -22,6 +22,7 @@
  */
 
 #include <linux/types.h>
+#include <linux/mm.h>
 #include <linux/sched/task.h>
 #include <linux/dynamic_debug.h>
 #include <drm/ttm/ttm_tt.h>
@@ -81,6 +82,26 @@ static const struct mmu_interval_notifier_ops svm_range_mn_ops = {
 	.invalidate = svm_range_cpu_invalidate_pagetables,
 };
 
+static unsigned int
+svm_range_mn_flags(struct svm_range *prange)
+{
+	struct kfd_process *p = container_of(prange->svms, struct kfd_process,
+					     svms);
+
+	if (!p->xnack_enabled ||
+	    (prange->flags & KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED))
+		return MMU_INTERVAL_NOTIFIER_BLOCK_THP;
+
+	return 0;
+}
+
+static int
+svm_range_update_mn_flags_locked(struct svm_range *prange)
+{
+	return mmu_interval_notifier_set_flags_locked(&prange->notifier,
+						      svm_range_mn_flags(prange));
+}
+
 /**
  * svm_range_unlink - unlink svm_range from lists and interval tree
  * @prange: svm range structure to be removed
@@ -112,10 +133,11 @@ svm_range_add_notifier_locked(struct mm_struct *mm, struct svm_range *prange)
 	pr_debug("svms 0x%p prange 0x%p [0x%lx 0x%lx]\n", prange->svms,
 		 prange, prange->start, prange->last);
 
-	mmu_interval_notifier_insert_locked(&prange->notifier, mm,
-				     prange->start << PAGE_SHIFT,
-				     prange->npages << PAGE_SHIFT,
-				     &svm_range_mn_ops);
+	mmu_interval_notifier_insert_locked_flags(&prange->notifier, mm,
+						  prange->start << PAGE_SHIFT,
+						  prange->npages << PAGE_SHIFT,
+						  &svm_range_mn_ops,
+						  svm_range_mn_flags(prange));
 }
 
 /**
@@ -3763,6 +3785,12 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
 	}
 	list_for_each_entry(prange, &update_list, update_list) {
 		svm_range_apply_attrs(p, prange, nattr, attrs, &update_mapping);
+		r = svm_range_update_mn_flags_locked(prange);
+		if (r) {
+			mutex_unlock(&svms->lock);
+			mmap_write_unlock(mm);
+			goto out;
+		}
 		/* TODO: unmap ranges from GPU that lost access */
 	}
 	update_mapping |= !p->xnack_enabled && !list_empty(&remap_list);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
  2026-06-25 10:59 [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings Yitao Jiang
                   ` (2 preceding siblings ...)
  2026-06-25 10:59 ` [PATCH 3/3] drm/amdkfd: block THP for non-replayable SVM ranges Yitao Jiang
@ 2026-06-25 11:47 ` David Hildenbrand (Arm)
  2026-06-25 11:54   ` Lorenzo Stoakes
  2026-06-25 12:35 ` Christian König
  4 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-25 11:47 UTC (permalink / raw)
  To: Yitao Jiang, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Felix Kuehling, Andrew Morton, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
	linux-kernel, linux-mm

On 6/25/26 12:59, Yitao Jiang wrote:
> Hi,
> 
> This series fixes a THP policy problem I found while debugging
> frequent ROCm GPU failures on an AMD Radeon 780M system during ML
> training.
> 
> Some AMDGPU/KFD user mappings are registered through interval
> notifiers and cannot safely tolerate the backing VMA changing from base
> pages to a transparent huge page after registration. Userspace can
> still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
> collapse the range, after the GPU mapping has been registered.

Huh, why? As a memory notifier user, you must be prepared from memory to get
unmapped+remapped at random points in time.

What is the precise problem here? How are you handling THPs at registration time?

Letting arbitrary drivers make THP policies sounds like the very wrong approach.

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
  2026-06-25 11:47 ` [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings David Hildenbrand (Arm)
@ 2026-06-25 11:54   ` Lorenzo Stoakes
  2026-06-25 12:14     ` 回复: " 蒋 亦韬
  0 siblings, 1 reply; 14+ messages in thread
From: Lorenzo Stoakes @ 2026-06-25 11:54 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Yitao Jiang, Alex Deucher, Christian König, David Airlie,
	Simona Vetter, Felix Kuehling, Andrew Morton, Zi Yan, Baolin Wang,
	Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Jann Horn, amd-gfx, dri-devel, linux-kernel,
	linux-mm

NAK to this or any version of this.

This series is insane and the idea is insane.

On Thu, Jun 25, 2026 at 01:47:25PM +0200, David Hildenbrand (Arm) wrote:
> On 6/25/26 12:59, Yitao Jiang wrote:
> > Hi,
> >
> > This series fixes a THP policy problem I found while debugging
> > frequent ROCm GPU failures on an AMD Radeon 780M system during ML
> > training.
> >
> > Some AMDGPU/KFD user mappings are registered through interval
> > notifiers and cannot safely tolerate the backing VMA changing from base
> > pages to a transparent huge page after registration. Userspace can
> > still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
> > collapse the range, after the GPU mapping has been registered.
>
> Huh, why? As a memory notifier user, you must be prepared from memory to get
> unmapped+remapped at random points in time.
>
> What is the precise problem here? How are you handling THPs at registration time?
>
> Letting arbitrary drivers make THP policies sounds like the very wrong approach.

We absolutely will not _ever_ allow drivers to do this while I still breath :)

>
> --
> Cheers,
>
> David

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 14+ messages in thread

* 回复: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
  2026-06-25 11:54   ` Lorenzo Stoakes
@ 2026-06-25 12:14     ` 蒋 亦韬
  0 siblings, 0 replies; 14+ messages in thread
From: 蒋 亦韬 @ 2026-06-25 12:14 UTC (permalink / raw)
  To: Lorenzo Stoakes, David Hildenbrand (Arm)
  Cc: Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Felix Kuehling, Andrew Morton, Zi Yan, Baolin Wang,
	Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
	Michal Hocko, Jann Horn, amd-gfx@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org

[-- Attachment #1: Type: text/plain, Size: 3691 bytes --]

Hi David, Lorenzo,

Thank you for the patient and direct feedback.

You are right; I misjudged the scope and abstraction here. My initial local fix was in the AMD driver path and addressed the failure I was seeing there. I then tried to move the solution into MM core because I guessed similar notifier users might hit the same class of problem. David's explanation makes clear that this was the wrong model: an MMU notifier user must tolerate unmap/remap, and mappings that cannot tolerate that need a different mechanism, such as page pinning, not a driver-controlled THP policy in MM core.

Sorry for the noise and for taking reviewer time. I appreciate the explanation, since it corrected my understanding of the expected MMU notifier and THP semantics.

On the AI assistance: I disclosed it because it was involved, and I did review the generated code against the behavior I thought I wanted. The failure here was my own misunderstanding of the MM core contract, which led to an inappropriate patch despite that review.

I will drop this series and will not send a v2 for this approach. I will re-scope the work to the AMDGPU/KFD side, with a minimal reproducer and a discussion/question first if MM input is needed, rather than proposing MM core changes.

Thanks again,
Yitao
________________________________
发件人: Lorenzo Stoakes <ljs@kernel.org>
发送时间: 2026年6月25日 7:54
收件人: David Hildenbrand (Arm) <david@kernel.org>
抄送: Yitao Jiang <jytscientist@hotmail.com>; Alex Deucher <alexander.deucher@amd.com>; Christian König <christian.koenig@amd.com>; David Airlie <airlied@gmail.com>; Simona Vetter <simona@ffwll.ch>; Felix Kuehling <Felix.Kuehling@amd.com>; Andrew Morton <akpm@linux-foundation.org>; Zi Yan <ziy@nvidia.com>; Baolin Wang <baolin.wang@linux.alibaba.com>; Liam R . Howlett <liam@infradead.org>; Nico Pache <npache@redhat.com>; Ryan Roberts <ryan.roberts@arm.com>; Dev Jain <dev.jain@arm.com>; Barry Song <baohua@kernel.org>; Lance Yang <lance.yang@linux.dev>; Vlastimil Babka <vbabka@kernel.org>; Mike Rapoport <rppt@kernel.org>; Suren Baghdasaryan <surenb@google.com>; Michal Hocko <mhocko@suse.com>; Jann Horn <jannh@google.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-mm@kvack.org <linux-mm@kvack.org>
主题: Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings

NAK to this or any version of this.

This series is insane and the idea is insane.

On Thu, Jun 25, 2026 at 01:47:25PM +0200, David Hildenbrand (Arm) wrote:
> On 6/25/26 12:59, Yitao Jiang wrote:
> > Hi,
> >
> > This series fixes a THP policy problem I found while debugging
> > frequent ROCm GPU failures on an AMD Radeon 780M system during ML
> > training.
> >
> > Some AMDGPU/KFD user mappings are registered through interval
> > notifiers and cannot safely tolerate the backing VMA changing from base
> > pages to a transparent huge page after registration. Userspace can
> > still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
> > collapse the range, after the GPU mapping has been registered.
>
> Huh, why? As a memory notifier user, you must be prepared from memory to get
> unmapped+remapped at random points in time.
>
> What is the precise problem here? How are you handling THPs at registration time?
>
> Letting arbitrary drivers make THP policies sounds like the very wrong approach.

We absolutely will not _ever_ allow drivers to do this while I still breath :)

>
> --
> Cheers,
>
> David

Thanks, Lorenzo

[-- Attachment #2: Type: text/html, Size: 6459 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
  2026-06-25 10:59 [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings Yitao Jiang
                   ` (3 preceding siblings ...)
  2026-06-25 11:47 ` [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings David Hildenbrand (Arm)
@ 2026-06-25 12:35 ` Christian König
  2026-06-25 13:01   ` 回复: " 蒋 亦韬
  4 siblings, 1 reply; 14+ messages in thread
From: Christian König @ 2026-06-25 12:35 UTC (permalink / raw)
  To: Yitao Jiang, Alex Deucher, David Airlie, Simona Vetter,
	Felix Kuehling, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
	linux-kernel, linux-mm

On 6/25/26 12:59, Yitao Jiang wrote:
> Hi,
> 
> This series fixes a THP policy problem I found while debugging
> frequent ROCm GPU failures on an AMD Radeon 780M system during ML
> training.
> 
> Some AMDGPU/KFD user mappings are registered through interval
> notifiers and cannot safely tolerate the backing VMA changing from base
> pages to a transparent huge page after registration.

That's certainly not correct. This is a must have for a whole lot of use cases.

Why exactly isn't that working for your use case?

Regards,
Christian.

> Userspace can
> still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
> collapse the range, after the GPU mapping has been registered.
> 
> On my system this showed up as asynchronous ROCm/HIP kernel launch
> failures, often reported later at a synchronization or copy point. I
> expect the issue to be relevant to AMDGPU/KFD mappings on
> XNACK-disabled GPUs more generally, because those mappings cannot rely
> on replayable GPU faults after a CPU-side THP remap. I have validated
> the failure and fix on AMD Radeon 780M / gfx1103.
> 
> Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
> users can ask the MM core to keep the covered VMA range out of THP
> while the notifier is active. The MM core applies VM_NOHUGEPAGE and
> clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
> over an active opt-in range is treated as an ignored hint, and
> MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.
> 
> Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
> HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
> GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
> current behavior.
> 
> This does not disable THP globally and does not add work to GPU
> command submission or kernel launch paths. Additional work is limited
> to opt-in notifier registration, opt-in notifier flag transitions, and
> MADV_HUGEPAGE attempts that overlap an active opt-in range.
> 
> I tested this on top of torvalds/linux commit ab9de95c9cf9 with:
> 
>   - scripts/checkpatch.pl --strict --no-tree
>   - git apply --check
>   - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
>     DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
>   - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
>     originally exposed the failure on my Radeon 780M system
> 
> The standalone reproducers depend on ROCm userspace libraries, so I
> have not included them in this series. I can send them separately if
> useful.
> 
> This series was prepared with assistance from OpenAI Codex (GPT-5.5).
> I reviewed the resulting code and take responsibility for the
> submission.
> 
> Yitao Jiang (3):
>   mm/mmu_notifier: let interval notifiers block THP
>   drm/amdgpu: block THP for HSA userptr notifiers
>   drm/amdkfd: block THP for non-replayable SVM ranges
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c |  25 ++-
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c    |  36 ++++-
>  include/linux/huge_mm.h                 |   5 +-
>  include/linux/mmu_notifier.h            |  28 ++++
>  mm/khugepaged.c                         |   9 +-
>  mm/madvise.c                            |   3 +-
>  mm/mmu_notifier.c                       | 204 +++++++++++++++++++++++-
>  7 files changed, 286 insertions(+), 24 deletions(-)
> 
> 
> base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
> --
> 2.53.0



^ permalink raw reply	[flat|nested] 14+ messages in thread

* 回复: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
  2026-06-25 12:35 ` Christian König
@ 2026-06-25 13:01   ` 蒋 亦韬
  2026-06-25 13:06     ` Christian König
  0 siblings, 1 reply; 14+ messages in thread
From: 蒋 亦韬 @ 2026-06-25 13:01 UTC (permalink / raw)
  To: Christian König, Alex Deucher, David Airlie, Simona Vetter,
	Felix Kuehling, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org

[-- Attachment #1: Type: text/plain, Size: 7324 bytes --]

Hi Christian,

I agree that my previous approach was wrong. Sorry about that. Please let me clarify the problem I was seeing and how I ended up with that incorrect conclusion.

The original problem was not a synthetic THP test. I was running ROCm/PyTorch ML training on an AMD Radeon 780M system, and the workload frequently failed with asynchronous HIP kernel launch failures. The userspace error usually surfaced later in PyTorch, for example around a copy/to_device/SetDevice path, but the kernel log showed GPU resets and KFD/MES queue eviction failures.

The relevant kernel messages I repeatedly saw were along these lines:

  MES failed to respond to msg=REMOVE_QUEUE
  MES failed to respond to msg=SUSPEND
  failed to suspend all gangs
  failed to remove hardware queue from MES
  Failed to evict queue
  Failed to evict process queues
  GPU reset begin

While trying to reduce the issue, I saw memory invalidations and THP-related page-table/backing-page activity driving the AMDGPU/KFD path through SVM eviction. On this system, the path I was looking at was roughly:

  svm_range_cpu_invalidate_pagetables()
    -> svm_range_evict()
    -> kgd2kfd_quiesce_mm()
    -> KFD process queue eviction
    -> MES REMOVE_QUEUE / SUSPEND

One thing that misled me was the XNACK-disabled path. Since the issue appeared on an XNACK-disabled APU, and that path requires queue eviction/quiesce when CPU page table invalidations affect GPU mappings, I incorrectly thought the backing-page change itself was something the driver had to prevent.

Another thing that misled me was that the application was not intentionally asking for THP behavior. From the workload’s point of view, these page transitions looked unrelated to the model computation. I therefore incorrectly assumed that userspace should not be able to change backing-page characteristics in a way that affects a driver mapping already registered with MMU interval notifiers. I now understand from the MM feedback that this is expected behavior, and that the notifier user must handle unmap/remap correctly.

So the more precise problem is that THP/remap is only one way to trigger the invalidation path. What is failing for my workload is the AMDGPU/KFD/MES queue quiesce/eviction path during those invalidations. When that fails, the GPU resets, and userspace later observes an asynchronous HIP failure.

Please allow me to continue investigating a more appropriate fix for this problem. I will try to keep the fix boundary within AMDGPU/KFD/MES and avoid changing MM-core or THP policy semantics.

Regards,
Yitao
________________________________
发件人: Christian König <christian.koenig@amd.com>
发送时间: 2026年6月25日 8:35
收件人: Yitao Jiang <jytscientist@hotmail.com>; Alex Deucher <alexander.deucher@amd.com>; David Airlie <airlied@gmail.com>; Simona Vetter <simona@ffwll.ch>; Felix Kuehling <Felix.Kuehling@amd.com>; Andrew Morton <akpm@linux-foundation.org>; David Hildenbrand <david@kernel.org>; Lorenzo Stoakes <ljs@kernel.org>
抄送: Zi Yan <ziy@nvidia.com>; Baolin Wang <baolin.wang@linux.alibaba.com>; Liam R . Howlett <liam@infradead.org>; Nico Pache <npache@redhat.com>; Ryan Roberts <ryan.roberts@arm.com>; Dev Jain <dev.jain@arm.com>; Barry Song <baohua@kernel.org>; Lance Yang <lance.yang@linux.dev>; Vlastimil Babka <vbabka@kernel.org>; Mike Rapoport <rppt@kernel.org>; Suren Baghdasaryan <surenb@google.com>; Michal Hocko <mhocko@suse.com>; Jann Horn <jannh@google.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-mm@kvack.org <linux-mm@kvack.org>
主题: Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings

On 6/25/26 12:59, Yitao Jiang wrote:
> Hi,
>
> This series fixes a THP policy problem I found while debugging
> frequent ROCm GPU failures on an AMD Radeon 780M system during ML
> training.
>
> Some AMDGPU/KFD user mappings are registered through interval
> notifiers and cannot safely tolerate the backing VMA changing from base
> pages to a transparent huge page after registration.

That's certainly not correct. This is a must have for a whole lot of use cases.

Why exactly isn't that working for your use case?

Regards,
Christian.

> Userspace can
> still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
> collapse the range, after the GPU mapping has been registered.
>
> On my system this showed up as asynchronous ROCm/HIP kernel launch
> failures, often reported later at a synchronization or copy point. I
> expect the issue to be relevant to AMDGPU/KFD mappings on
> XNACK-disabled GPUs more generally, because those mappings cannot rely
> on replayable GPU faults after a CPU-side THP remap. I have validated
> the failure and fix on AMD Radeon 780M / gfx1103.
>
> Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
> users can ask the MM core to keep the covered VMA range out of THP
> while the notifier is active. The MM core applies VM_NOHUGEPAGE and
> clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
> over an active opt-in range is treated as an ignored hint, and
> MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.
>
> Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
> HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
> GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
> current behavior.
>
> This does not disable THP globally and does not add work to GPU
> command submission or kernel launch paths. Additional work is limited
> to opt-in notifier registration, opt-in notifier flag transitions, and
> MADV_HUGEPAGE attempts that overlap an active opt-in range.
>
> I tested this on top of torvalds/linux commit ab9de95c9cf9 with:
>
>   - scripts/checkpatch.pl --strict --no-tree
>   - git apply --check
>   - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
>     DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
>   - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
>     originally exposed the failure on my Radeon 780M system
>
> The standalone reproducers depend on ROCm userspace libraries, so I
> have not included them in this series. I can send them separately if
> useful.
>
> This series was prepared with assistance from OpenAI Codex (GPT-5.5).
> I reviewed the resulting code and take responsibility for the
> submission.
>
> Yitao Jiang (3):
>   mm/mmu_notifier: let interval notifiers block THP
>   drm/amdgpu: block THP for HSA userptr notifiers
>   drm/amdkfd: block THP for non-replayable SVM ranges
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c |  25 ++-
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c    |  36 ++++-
>  include/linux/huge_mm.h                 |   5 +-
>  include/linux/mmu_notifier.h            |  28 ++++
>  mm/khugepaged.c                         |   9 +-
>  mm/madvise.c                            |   3 +-
>  mm/mmu_notifier.c                       | 204 +++++++++++++++++++++++-
>  7 files changed, 286 insertions(+), 24 deletions(-)
>
>
> base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
> --
> 2.53.0

[-- Attachment #2: Type: text/html, Size: 13806 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 回复: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
  2026-06-25 13:01   ` 回复: " 蒋 亦韬
@ 2026-06-25 13:06     ` Christian König
  2026-06-25 20:51       ` Kuehling, Felix
  0 siblings, 1 reply; 14+ messages in thread
From: Christian König @ 2026-06-25 13:06 UTC (permalink / raw)
  To: 蒋 亦韬, Alex Deucher, David Airlie,
	Simona Vetter, Felix Kuehling, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Yang, Philip
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org

Hi Yitao,

adding Philip Yang.

Thanks for the investigation, that sounds like some kind of bug in the KFD SVM handling. The driver should be perfectly capable of handling this.

I strongly suggest to open up a bug report for ROCm and describe how to reproduce this, Philip can probably point you to the right location for that.

Regards,
Christian.

On 6/25/26 15:01, 蒋 亦韬 wrote:
> Hi Christian,
> 
> I agree that my previous approach was wrong. Sorry about that. Please let me clarify the problem I was seeing and how I ended up with that incorrect conclusion.
> 
> The original problem was not a synthetic THP test. I was running ROCm/PyTorch ML training on an AMD Radeon 780M system, and the workload frequently failed with asynchronous HIP kernel launch failures. The userspace error usually surfaced later in PyTorch, for example around a copy/to_device/SetDevice path, but the kernel log showed GPU resets and KFD/MES queue eviction failures.
> 
> The relevant kernel messages I repeatedly saw were along these lines:
> 
>   MES failed to respond to msg=REMOVE_QUEUE
>   MES failed to respond to msg=SUSPEND
>   failed to suspend all gangs
>   failed to remove hardware queue from MES
>   Failed to evict queue
>   Failed to evict process queues
>   GPU reset begin
> 
> While trying to reduce the issue, I saw memory invalidations and THP-related page-table/backing-page activity driving the AMDGPU/KFD path through SVM eviction. On this system, the path I was looking at was roughly:
> 
>   svm_range_cpu_invalidate_pagetables()
>     -> svm_range_evict()
>     -> kgd2kfd_quiesce_mm()
>     -> KFD process queue eviction
>     -> MES REMOVE_QUEUE / SUSPEND
> 
> One thing that misled me was the XNACK-disabled path. Since the issue appeared on an XNACK-disabled APU, and that path requires queue eviction/quiesce when CPU page table invalidations affect GPU mappings, I incorrectly thought the backing-page change itself was something the driver had to prevent.
> 
> Another thing that misled me was that the application was not intentionally asking for THP behavior. From the workload’s point of view, these page transitions looked unrelated to the model computation. I therefore incorrectly assumed that userspace should not be able to change backing-page characteristics in a way that affects a driver mapping already registered with MMU interval notifiers. I now understand from the MM feedback that this is expected behavior, and that the notifier user must handle unmap/remap correctly.
> 
> So the more precise problem is that THP/remap is only one way to trigger the invalidation path. What is failing for my workload is the AMDGPU/KFD/MES queue quiesce/eviction path during those invalidations. When that fails, the GPU resets, and userspace later observes an asynchronous HIP failure.
> 
> Please allow me to continue investigating a more appropriate fix for this problem. I will try to keep the fix boundary within AMDGPU/KFD/MES and avoid changing MM-core or THP policy semantics.
> 
> Regards,
> Yitao
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *发件人:* Christian König <christian.koenig@amd.com>
> *发送时间:* 2026年6月25日 8:35
> *收件人:* Yitao Jiang <jytscientist@hotmail.com>; Alex Deucher <alexander.deucher@amd.com>; David Airlie <airlied@gmail.com>; Simona Vetter <simona@ffwll.ch>; Felix Kuehling <Felix.Kuehling@amd.com>; Andrew Morton <akpm@linux-foundation.org>; David Hildenbrand <david@kernel.org>; Lorenzo Stoakes <ljs@kernel.org>
> *抄送:* Zi Yan <ziy@nvidia.com>; Baolin Wang <baolin.wang@linux.alibaba.com>; Liam R . Howlett <liam@infradead.org>; Nico Pache <npache@redhat.com>; Ryan Roberts <ryan.roberts@arm.com>; Dev Jain <dev.jain@arm.com>; Barry Song <baohua@kernel.org>; Lance Yang <lance.yang@linux.dev>; Vlastimil Babka <vbabka@kernel.org>; Mike Rapoport <rppt@kernel.org>; Suren Baghdasaryan <surenb@google.com>; Michal Hocko <mhocko@suse.com>; Jann Horn <jannh@google.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-mm@kvack.org <linux-mm@kvack.org>
> *主题:* Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
>  
> On 6/25/26 12:59, Yitao Jiang wrote:
>> Hi,
>> 
>> This series fixes a THP policy problem I found while debugging
>> frequent ROCm GPU failures on an AMD Radeon 780M system during ML
>> training.
>> 
>> Some AMDGPU/KFD user mappings are registered through interval
>> notifiers and cannot safely tolerate the backing VMA changing from base
>> pages to a transparent huge page after registration.
> 
> That's certainly not correct. This is a must have for a whole lot of use cases.
> 
> Why exactly isn't that working for your use case?
> 
> Regards,
> Christian.
> 
>> Userspace can
>> still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
>> collapse the range, after the GPU mapping has been registered.
>> 
>> On my system this showed up as asynchronous ROCm/HIP kernel launch
>> failures, often reported later at a synchronization or copy point. I
>> expect the issue to be relevant to AMDGPU/KFD mappings on
>> XNACK-disabled GPUs more generally, because those mappings cannot rely
>> on replayable GPU faults after a CPU-side THP remap. I have validated
>> the failure and fix on AMD Radeon 780M / gfx1103.
>> 
>> Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
>> users can ask the MM core to keep the covered VMA range out of THP
>> while the notifier is active. The MM core applies VM_NOHUGEPAGE and
>> clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
>> over an active opt-in range is treated as an ignored hint, and
>> MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.
>> 
>> Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
>> HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
>> GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
>> current behavior.
>> 
>> This does not disable THP globally and does not add work to GPU
>> command submission or kernel launch paths. Additional work is limited
>> to opt-in notifier registration, opt-in notifier flag transitions, and
>> MADV_HUGEPAGE attempts that overlap an active opt-in range.
>> 
>> I tested this on top of torvalds/linux commit ab9de95c9cf9 with:
>> 
>>   - scripts/checkpatch.pl --strict --no-tree
>>   - git apply --check
>>   - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
>>     DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
>>   - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
>>     originally exposed the failure on my Radeon 780M system
>> 
>> The standalone reproducers depend on ROCm userspace libraries, so I
>> have not included them in this series. I can send them separately if
>> useful.
>> 
>> This series was prepared with assistance from OpenAI Codex (GPT-5.5).
>> I reviewed the resulting code and take responsibility for the
>> submission.
>> 
>> Yitao Jiang (3):
>>   mm/mmu_notifier: let interval notifiers block THP
>>   drm/amdgpu: block THP for HSA userptr notifiers
>>   drm/amdkfd: block THP for non-replayable SVM ranges
>> 
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c |  25 ++-
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c    |  36 ++++-
>>  include/linux/huge_mm.h                 |   5 +-
>>  include/linux/mmu_notifier.h            |  28 ++++
>>  mm/khugepaged.c                         |   9 +-
>>  mm/madvise.c                            |   3 +-
>>  mm/mmu_notifier.c                       | 204 +++++++++++++++++++++++-
>>  7 files changed, 286 insertions(+), 24 deletions(-)
>> 
>> 
>> base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
>> --
>> 2.53.0
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 回复: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
  2026-06-25 13:06     ` Christian König
@ 2026-06-25 20:51       ` Kuehling, Felix
  0 siblings, 0 replies; 14+ messages in thread
From: Kuehling, Felix @ 2026-06-25 20:51 UTC (permalink / raw)
  To: Christian König, 蒋 亦韬, Alex Deucher,
	David Airlie, Simona Vetter, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Yang, Philip
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org

If there are MES queue eviction failures, then the root cause is most 
likely an MES firmware problem or some bug in the driver's interaction 
with MES. Your application dies in the GPU reset that follows. The MMU 
notifier handling and THP change is not the root cause. It's only the 
thing that happens to trigger the MES problem. The same thing could 
happen with NUMA migrations, applications forking or being terminated 
with Ctrl+C. In all of these scenarios the driver depends on MES to 
preempt the user mode queues before the MMU notifier returns.

Regards,
   Felix


On 2026-06-25 09:06, Christian König wrote:
> Hi Yitao,
>
> adding Philip Yang.
>
> Thanks for the investigation, that sounds like some kind of bug in the KFD SVM handling. The driver should be perfectly capable of handling this.
>
> I strongly suggest to open up a bug report for ROCm and describe how to reproduce this, Philip can probably point you to the right location for that.
>
> Regards,
> Christian.
>
> On 6/25/26 15:01, 蒋 亦韬 wrote:
>> Hi Christian,
>>
>> I agree that my previous approach was wrong. Sorry about that. Please let me clarify the problem I was seeing and how I ended up with that incorrect conclusion.
>>
>> The original problem was not a synthetic THP test. I was running ROCm/PyTorch ML training on an AMD Radeon 780M system, and the workload frequently failed with asynchronous HIP kernel launch failures. The userspace error usually surfaced later in PyTorch, for example around a copy/to_device/SetDevice path, but the kernel log showed GPU resets and KFD/MES queue eviction failures.
>>
>> The relevant kernel messages I repeatedly saw were along these lines:
>>
>>    MES failed to respond to msg=REMOVE_QUEUE
>>    MES failed to respond to msg=SUSPEND
>>    failed to suspend all gangs
>>    failed to remove hardware queue from MES
>>    Failed to evict queue
>>    Failed to evict process queues
>>    GPU reset begin
>>
>> While trying to reduce the issue, I saw memory invalidations and THP-related page-table/backing-page activity driving the AMDGPU/KFD path through SVM eviction. On this system, the path I was looking at was roughly:
>>
>>    svm_range_cpu_invalidate_pagetables()
>>      -> svm_range_evict()
>>      -> kgd2kfd_quiesce_mm()
>>      -> KFD process queue eviction
>>      -> MES REMOVE_QUEUE / SUSPEND
>>
>> One thing that misled me was the XNACK-disabled path. Since the issue appeared on an XNACK-disabled APU, and that path requires queue eviction/quiesce when CPU page table invalidations affect GPU mappings, I incorrectly thought the backing-page change itself was something the driver had to prevent.
>>
>> Another thing that misled me was that the application was not intentionally asking for THP behavior. From the workload’s point of view, these page transitions looked unrelated to the model computation. I therefore incorrectly assumed that userspace should not be able to change backing-page characteristics in a way that affects a driver mapping already registered with MMU interval notifiers. I now understand from the MM feedback that this is expected behavior, and that the notifier user must handle unmap/remap correctly.
>>
>> So the more precise problem is that THP/remap is only one way to trigger the invalidation path. What is failing for my workload is the AMDGPU/KFD/MES queue quiesce/eviction path during those invalidations. When that fails, the GPU resets, and userspace later observes an asynchronous HIP failure.
>>
>> Please allow me to continue investigating a more appropriate fix for this problem. I will try to keep the fix boundary within AMDGPU/KFD/MES and avoid changing MM-core or THP policy semantics.
>>
>> Regards,
>> Yitao
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> *发件人:* Christian König <christian.koenig@amd.com>
>> *发送时间:* 2026年6月25日 8:35
>> *收件人:* Yitao Jiang <jytscientist@hotmail.com>; Alex Deucher <alexander.deucher@amd.com>; David Airlie <airlied@gmail.com>; Simona Vetter <simona@ffwll.ch>; Felix Kuehling <Felix.Kuehling@amd.com>; Andrew Morton <akpm@linux-foundation.org>; David Hildenbrand <david@kernel.org>; Lorenzo Stoakes <ljs@kernel.org>
>> *抄送:* Zi Yan <ziy@nvidia.com>; Baolin Wang <baolin.wang@linux.alibaba.com>; Liam R . Howlett <liam@infradead.org>; Nico Pache <npache@redhat.com>; Ryan Roberts <ryan.roberts@arm.com>; Dev Jain <dev.jain@arm.com>; Barry Song <baohua@kernel.org>; Lance Yang <lance.yang@linux.dev>; Vlastimil Babka <vbabka@kernel.org>; Mike Rapoport <rppt@kernel.org>; Suren Baghdasaryan <surenb@google.com>; Michal Hocko <mhocko@suse.com>; Jann Horn <jannh@google.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-mm@kvack.org <linux-mm@kvack.org>
>> *主题:* Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
>>   
>> On 6/25/26 12:59, Yitao Jiang wrote:
>>> Hi,
>>>
>>> This series fixes a THP policy problem I found while debugging
>>> frequent ROCm GPU failures on an AMD Radeon 780M system during ML
>>> training.
>>>
>>> Some AMDGPU/KFD user mappings are registered through interval
>>> notifiers and cannot safely tolerate the backing VMA changing from base
>>> pages to a transparent huge page after registration.
>> That's certainly not correct. This is a must have for a whole lot of use cases.
>>
>> Why exactly isn't that working for your use case?
>>
>> Regards,
>> Christian.
>>
>>> Userspace can
>>> still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
>>> collapse the range, after the GPU mapping has been registered.
>>>
>>> On my system this showed up as asynchronous ROCm/HIP kernel launch
>>> failures, often reported later at a synchronization or copy point. I
>>> expect the issue to be relevant to AMDGPU/KFD mappings on
>>> XNACK-disabled GPUs more generally, because those mappings cannot rely
>>> on replayable GPU faults after a CPU-side THP remap. I have validated
>>> the failure and fix on AMD Radeon 780M / gfx1103.
>>>
>>> Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
>>> users can ask the MM core to keep the covered VMA range out of THP
>>> while the notifier is active. The MM core applies VM_NOHUGEPAGE and
>>> clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
>>> over an active opt-in range is treated as an ignored hint, and
>>> MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.
>>>
>>> Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
>>> HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
>>> GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
>>> current behavior.
>>>
>>> This does not disable THP globally and does not add work to GPU
>>> command submission or kernel launch paths. Additional work is limited
>>> to opt-in notifier registration, opt-in notifier flag transitions, and
>>> MADV_HUGEPAGE attempts that overlap an active opt-in range.
>>>
>>> I tested this on top of torvalds/linux commit ab9de95c9cf9 with:
>>>
>>>     - scripts/checkpatch.pl --strict --no-tree
>>>     - git apply --check
>>>     - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
>>>       DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
>>>     - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
>>>       originally exposed the failure on my Radeon 780M system
>>>
>>> The standalone reproducers depend on ROCm userspace libraries, so I
>>> have not included them in this series. I can send them separately if
>>> useful.
>>>
>>> This series was prepared with assistance from OpenAI Codex (GPT-5.5).
>>> I reviewed the resulting code and take responsibility for the
>>> submission.
>>>
>>> Yitao Jiang (3):
>>>     mm/mmu_notifier: let interval notifiers block THP
>>>     drm/amdgpu: block THP for HSA userptr notifiers
>>>     drm/amdkfd: block THP for non-replayable SVM ranges
>>>
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c |  25 ++-
>>>    drivers/gpu/drm/amd/amdkfd/kfd_svm.c    |  36 ++++-
>>>    include/linux/huge_mm.h                 |   5 +-
>>>    include/linux/mmu_notifier.h            |  28 ++++
>>>    mm/khugepaged.c                         |   9 +-
>>>    mm/madvise.c                            |   3 +-
>>>    mm/mmu_notifier.c                       | 204 +++++++++++++++++++++++-
>>>    7 files changed, 286 insertions(+), 24 deletions(-)
>>>
>>>
>>> base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
>>> --
>>> 2.53.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-06-25 20:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-25 10:59 [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings Yitao Jiang
2026-06-25 10:59 ` [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP Yitao Jiang
2026-06-25 11:50   ` David Hildenbrand (Arm)
2026-06-25 11:58   ` Lorenzo Stoakes
2026-06-25 10:59 ` [PATCH 2/3] drm/amdgpu: block THP for HSA userptr notifiers Yitao Jiang
2026-06-25 12:36   ` Christian König
2026-06-25 10:59 ` [PATCH 3/3] drm/amdkfd: block THP for non-replayable SVM ranges Yitao Jiang
2026-06-25 11:47 ` [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings David Hildenbrand (Arm)
2026-06-25 11:54   ` Lorenzo Stoakes
2026-06-25 12:14     ` 回复: " 蒋 亦韬
2026-06-25 12:35 ` Christian König
2026-06-25 13:01   ` 回复: " 蒋 亦韬
2026-06-25 13:06     ` Christian König
2026-06-25 20:51       ` Kuehling, Felix

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox