* [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
@ 2026-06-25 10:59 Yitao Jiang
2026-06-25 10:59 ` [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP Yitao Jiang
` (4 more replies)
0 siblings, 5 replies; 16+ messages in thread
From: Yitao Jiang @ 2026-06-25 10:59 UTC (permalink / raw)
To: Alex Deucher, Christian König, David Airlie, Simona Vetter,
Felix Kuehling, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
linux-kernel, linux-mm, Yitao Jiang
Hi,
This series fixes a THP policy problem I found while debugging
frequent ROCm GPU failures on an AMD Radeon 780M system during ML
training.
Some AMDGPU/KFD user mappings are registered through interval
notifiers and cannot safely tolerate the backing VMA changing from base
pages to a transparent huge page after registration. Userspace can
still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
collapse the range, after the GPU mapping has been registered.
On my system this showed up as asynchronous ROCm/HIP kernel launch
failures, often reported later at a synchronization or copy point. I
expect the issue to be relevant to AMDGPU/KFD mappings on
XNACK-disabled GPUs more generally, because those mappings cannot rely
on replayable GPU faults after a CPU-side THP remap. I have validated
the failure and fix on AMD Radeon 780M / gfx1103.
Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
users can ask the MM core to keep the covered VMA range out of THP
while the notifier is active. The MM core applies VM_NOHUGEPAGE and
clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
over an active opt-in range is treated as an ignored hint, and
MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.
Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
current behavior.
This does not disable THP globally and does not add work to GPU
command submission or kernel launch paths. Additional work is limited
to opt-in notifier registration, opt-in notifier flag transitions, and
MADV_HUGEPAGE attempts that overlap an active opt-in range.
I tested this on top of torvalds/linux commit ab9de95c9cf9 with:
- scripts/checkpatch.pl --strict --no-tree
- git apply --check
- x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
- standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
originally exposed the failure on my Radeon 780M system
The standalone reproducers depend on ROCm userspace libraries, so I
have not included them in this series. I can send them separately if
useful.
This series was prepared with assistance from OpenAI Codex (GPT-5.5).
I reviewed the resulting code and take responsibility for the
submission.
Yitao Jiang (3):
mm/mmu_notifier: let interval notifiers block THP
drm/amdgpu: block THP for HSA userptr notifiers
drm/amdkfd: block THP for non-replayable SVM ranges
drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 25 ++-
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 36 ++++-
include/linux/huge_mm.h | 5 +-
include/linux/mmu_notifier.h | 28 ++++
mm/khugepaged.c | 9 +-
mm/madvise.c | 3 +-
mm/mmu_notifier.c | 204 +++++++++++++++++++++++-
7 files changed, 286 insertions(+), 24 deletions(-)
base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
--
2.53.0
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP
2026-06-25 10:59 [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings Yitao Jiang
@ 2026-06-25 10:59 ` Yitao Jiang
2026-06-25 11:18 ` sashiko-bot
` (2 more replies)
2026-06-25 10:59 ` [PATCH 2/3] drm/amdgpu: block THP for HSA userptr notifiers Yitao Jiang
` (3 subsequent siblings)
4 siblings, 3 replies; 16+ messages in thread
From: Yitao Jiang @ 2026-06-25 10:59 UTC (permalink / raw)
To: Alex Deucher, Christian König, David Airlie, Simona Vetter,
Felix Kuehling, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
linux-kernel, linux-mm, Yitao Jiang
Some secondary MMUs cannot safely tolerate a user VMA becoming backed
by transparent huge pages after the range has been registered with an
interval notifier. Drivers can observe the page-table layout change
through invalidations, but devices without replayable faults, or ranges
that must stay mapped, cannot necessarily re-establish coherent device
mappings before later device access.
Add MMU_INTERVAL_NOTIFIER_BLOCK_THP so a driver can declare this
property when registering an interval notifier. The MM core then marks
the covered VMA range VM_NOHUGEPAGE and clears VM_HUGEPAGE while
holding mmap_lock for write. A later MADV_HUGEPAGE on the same active
range is treated as an ignored hint, leaving the MM-owned nohuge
policy intact. MADV_COLLAPSE already rejects VM_NOHUGEPAGE VMAs.
This keeps the policy in MM code instead of requiring device drivers
to edit VMA THP flags directly, and it only affects opt-in notifier
ranges at registration or flag-transition time.
Assisted-by: OpenAI-Codex:GPT-5.5
Signed-off-by: Yitao Jiang <jytscientist@hotmail.com>
---
include/linux/huge_mm.h | 5 +-
include/linux/mmu_notifier.h | 28 +++++
mm/khugepaged.c | 9 +-
mm/madvise.c | 3 +-
mm/mmu_notifier.c | 204 +++++++++++++++++++++++++++++++++--
5 files changed, 237 insertions(+), 12 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad20f7f8c..3dae515ff 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -489,8 +489,8 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
__split_huge_pud(__vma, __pud, __address); \
} while (0)
-int hugepage_madvise(struct vm_area_struct *vma, vm_flags_t *vm_flags,
- int advice);
+int hugepage_madvise(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, vm_flags_t *vm_flags, int advice);
int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
unsigned long end, bool *lock_dropped);
void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
@@ -694,6 +694,7 @@ static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
do { } while (0)
static inline int hugepage_madvise(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
vm_flags_t *vm_flags, int advice)
{
return -EINVAL;
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index a11a44eef..4accfb65f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -293,8 +293,16 @@ struct mmu_interval_notifier {
struct mm_struct *mm;
struct hlist_node deferred_item;
unsigned long invalidate_seq;
+ unsigned int flags;
};
+/*
+ * The interval range cannot safely be backed by transparent huge pages while
+ * the notifier is active. The MM core owns the VMA policy change so drivers
+ * do not have to manipulate VM_HUGEPAGE/VM_NOHUGEPAGE directly.
+ */
+#define MMU_INTERVAL_NOTIFIER_BLOCK_THP BIT(0)
+
#ifdef CONFIG_MMU_NOTIFIER
#ifdef CONFIG_LOCKDEP
@@ -347,7 +355,20 @@ int mmu_interval_notifier_insert_locked(
struct mmu_interval_notifier *interval_sub, struct mm_struct *mm,
unsigned long start, unsigned long length,
const struct mmu_interval_notifier_ops *ops);
+int
+mmu_interval_notifier_insert_locked_flags(struct mmu_interval_notifier *interval_sub,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long length,
+ const struct mmu_interval_notifier_ops *ops,
+ unsigned int flags);
+int
+mmu_interval_notifier_set_flags_locked(struct mmu_interval_notifier *interval_sub,
+ unsigned int flags);
void mmu_interval_notifier_remove(struct mmu_interval_notifier *interval_sub);
+bool mmu_interval_notifier_range_block_thp(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end);
/**
* mmu_interval_set_seq - Save the invalidation sequence
@@ -637,6 +658,13 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
{
}
+static inline bool mmu_interval_notifier_range_block_thp(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ return false;
+}
+
#define mmu_notifier_range_update_to_read_only(r) false
static inline void mmu_notifier_synchronize(void)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 617bca76d..a9b05e716 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -445,11 +445,16 @@ static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
return khugepaged_max_ptes_swap;
}
-int hugepage_madvise(struct vm_area_struct *vma,
- vm_flags_t *vm_flags, int advice)
+int hugepage_madvise(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, vm_flags_t *vm_flags, int advice)
{
switch (advice) {
case MADV_HUGEPAGE:
+ if ((*vm_flags & VM_NOHUGEPAGE) &&
+ mmu_interval_notifier_range_block_thp(vma->vm_mm,
+ start, end))
+ return 0;
+
*vm_flags &= ~VM_NOHUGEPAGE;
*vm_flags |= VM_HUGEPAGE;
/*
diff --git a/mm/madvise.c b/mm/madvise.c
index cd9bb0770..c7cee4fcf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1416,7 +1416,8 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
break;
case MADV_HUGEPAGE:
case MADV_NOHUGEPAGE:
- error = hugepage_madvise(vma, &new_flags, behavior);
+ error = hugepage_madvise(vma, range->start, range->end,
+ &new_flags, behavior);
if (error)
goto out;
break;
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 245b74f39..852a5682b 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -581,6 +581,49 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
return 0;
}
+/**
+ * mmu_interval_notifier_range_block_thp - check if a range must not use THP
+ * @mm: mm_struct to check
+ * @start: start address
+ * @end: end address
+ *
+ * Return true if an active interval notifier covering the range requested
+ * MMU_INTERVAL_NOTIFIER_BLOCK_THP.
+ */
+bool mmu_interval_notifier_range_block_thp(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ struct mmu_notifier_subscriptions *subscriptions;
+ struct mmu_interval_notifier *interval_sub;
+ struct interval_tree_node *node;
+ bool block_thp = false;
+
+ if (start >= end)
+ return false;
+
+ /* Pairs with the store in mmu_notifier_register(). */
+ subscriptions = smp_load_acquire(&mm->notifier_subscriptions);
+ if (!subscriptions || !subscriptions->has_itree)
+ return false;
+
+ spin_lock(&subscriptions->lock);
+ for (node = interval_tree_iter_first(&subscriptions->itree, start,
+ end - 1);
+ node;
+ node = interval_tree_iter_next(node, start, end - 1)) {
+ interval_sub = container_of(node, struct mmu_interval_notifier,
+ interval_tree);
+ if (interval_sub->flags & MMU_INTERVAL_NOTIFIER_BLOCK_THP) {
+ block_thp = true;
+ break;
+ }
+ }
+ spin_unlock(&subscriptions->lock);
+
+ return block_thp;
+}
+
static void
mn_hlist_invalidate_end(struct mmu_notifier_subscriptions *subscriptions,
struct mmu_notifier_range *range)
@@ -933,13 +976,69 @@ void mmu_notifier_put(struct mmu_notifier *subscription)
}
EXPORT_SYMBOL_GPL(mmu_notifier_put);
+#define MMU_INTERVAL_NOTIFIER_KNOWN_FLAGS \
+ (MMU_INTERVAL_NOTIFIER_BLOCK_THP)
+
+static int mmu_interval_notifier_check_flags(unsigned int flags)
+{
+ if (flags & ~MMU_INTERVAL_NOTIFIER_KNOWN_FLAGS)
+ return -EINVAL;
+ return 0;
+}
+
+static int
+mmu_interval_notifier_block_thp_locked(struct mm_struct *mm,
+ unsigned long start,
+ unsigned long end)
+{
+ struct vm_area_struct *vma, *prev;
+ struct vma_iterator vmi;
+
+ mmap_assert_write_locked(mm);
+
+ vma_iter_init(&vmi, mm, start);
+ vma = vma_iter_load(&vmi);
+ prev = vma_prev(&vmi);
+ if (vma && start > vma->vm_start)
+ prev = vma;
+
+ for_each_vma_range(vmi, vma, end) {
+ const unsigned long curr_start = max(vma->vm_start, start);
+ const unsigned long curr_end = min(vma->vm_end, end);
+ vma_flags_t new_flags;
+
+ if (vma->vm_flags & VM_NO_KHUGEPAGED)
+ goto next;
+
+ new_flags = vma->flags;
+ vma_flags_set(&new_flags, VMA_NOHUGEPAGE_BIT);
+ vma_flags_clear(&new_flags, VMA_HUGEPAGE_BIT);
+ if (vma_flags_same_pair(&new_flags, &vma->flags))
+ goto next;
+
+ vma = vma_modify_flags(&vmi, prev, vma, curr_start,
+ curr_end, &new_flags);
+ if (IS_ERR(vma))
+ return PTR_ERR(vma);
+
+ vma_start_write(vma);
+ vma->flags = new_flags;
+next:
+ prev = vma;
+ }
+
+ return 0;
+}
+
static int __mmu_interval_notifier_insert(
struct mmu_interval_notifier *interval_sub, struct mm_struct *mm,
struct mmu_notifier_subscriptions *subscriptions, unsigned long start,
- unsigned long length, const struct mmu_interval_notifier_ops *ops)
+ unsigned long length, const struct mmu_interval_notifier_ops *ops,
+ unsigned int flags)
{
interval_sub->mm = mm;
interval_sub->ops = ops;
+ interval_sub->flags = flags;
RB_CLEAR_NODE(&interval_sub->interval_tree.rb);
interval_sub->interval_tree.start = start;
/*
@@ -1034,32 +1133,123 @@ int mmu_interval_notifier_insert(struct mmu_interval_notifier *interval_sub,
subscriptions = mm->notifier_subscriptions;
}
return __mmu_interval_notifier_insert(interval_sub, mm, subscriptions,
- start, length, ops);
+ start, length, ops, 0);
}
EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert);
-int mmu_interval_notifier_insert_locked(
- struct mmu_interval_notifier *interval_sub, struct mm_struct *mm,
- unsigned long start, unsigned long length,
- const struct mmu_interval_notifier_ops *ops)
+/**
+ * mmu_interval_notifier_insert_locked_flags - Insert an interval notifier
+ * @interval_sub: Interval subscription to register
+ * @mm: mm_struct to attach to
+ * @start: Starting virtual address to monitor
+ * @length: Length of the range to monitor
+ * @ops: Interval notifier operations to be called on matching events
+ * @flags: MMU_INTERVAL_NOTIFIER_* flags
+ *
+ * Like mmu_interval_notifier_insert_locked(), but lets callers request
+ * additional MM-owned policy for the interval while holding mmap_lock for
+ * write.
+ */
+int
+mmu_interval_notifier_insert_locked_flags(struct mmu_interval_notifier *interval_sub,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long length,
+ const struct mmu_interval_notifier_ops *ops,
+ unsigned int flags)
{
struct mmu_notifier_subscriptions *subscriptions =
mm->notifier_subscriptions;
+ unsigned long end;
int ret;
mmap_assert_write_locked(mm);
+ ret = mmu_interval_notifier_check_flags(flags);
+ if (ret)
+ return ret;
+
+ if (flags & MMU_INTERVAL_NOTIFIER_BLOCK_THP) {
+ if (length == 0 || check_add_overflow(start, length, &end))
+ return -EOVERFLOW;
+ }
+
if (!subscriptions || !subscriptions->has_itree) {
ret = __mmu_notifier_register(NULL, mm);
if (ret)
return ret;
subscriptions = mm->notifier_subscriptions;
}
+
+ if (flags & MMU_INTERVAL_NOTIFIER_BLOCK_THP) {
+ ret = mmu_interval_notifier_block_thp_locked(mm, start, end);
+ if (ret)
+ return ret;
+ }
+
return __mmu_interval_notifier_insert(interval_sub, mm, subscriptions,
- start, length, ops);
+ start, length, ops, flags);
+}
+EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_locked_flags);
+
+int mmu_interval_notifier_insert_locked(struct mmu_interval_notifier *interval_sub,
+ struct mm_struct *mm,
+ unsigned long start,
+ unsigned long length,
+ const struct mmu_interval_notifier_ops *ops)
+{
+ return mmu_interval_notifier_insert_locked_flags(interval_sub, mm,
+ start, length,
+ ops, 0);
}
EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_locked);
+/**
+ * mmu_interval_notifier_set_flags_locked - update an interval notifier's flags
+ * @interval_sub: Interval subscription to update
+ * @flags: MMU_INTERVAL_NOTIFIER_* flags
+ *
+ * Update MMU interval notifier flags while holding mmap_lock for write. When
+ * enabling MMU_INTERVAL_NOTIFIER_BLOCK_THP, the MM core first updates the VMA
+ * THP policy for the notifier's address range.
+ */
+int
+mmu_interval_notifier_set_flags_locked(struct mmu_interval_notifier *interval_sub,
+ unsigned int flags)
+{
+ struct mm_struct *mm = interval_sub->mm;
+ unsigned long start = interval_sub->interval_tree.start;
+ unsigned long end;
+ int ret;
+
+ ret = mmu_interval_notifier_check_flags(flags);
+ if (ret)
+ return ret;
+
+ if (WARN_ON_ONCE(!mm))
+ return -EINVAL;
+
+ mmap_assert_write_locked(mm);
+
+ if ((flags & MMU_INTERVAL_NOTIFIER_BLOCK_THP) &&
+ !(interval_sub->flags & MMU_INTERVAL_NOTIFIER_BLOCK_THP)) {
+ if (interval_sub->interval_tree.last == ULONG_MAX)
+ return -EOVERFLOW;
+ end = interval_sub->interval_tree.last + 1;
+
+ ret = mmu_interval_notifier_block_thp_locked(mm, start, end);
+ if (ret)
+ return ret;
+ }
+
+ spin_lock(&mm->notifier_subscriptions->lock);
+ interval_sub->flags = flags;
+ spin_unlock(&mm->notifier_subscriptions->lock);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mmu_interval_notifier_set_flags_locked);
+
static bool
mmu_interval_seq_released(struct mmu_notifier_subscriptions *subscriptions,
unsigned long seq)
--
2.53.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/3] drm/amdgpu: block THP for HSA userptr notifiers
2026-06-25 10:59 [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings Yitao Jiang
2026-06-25 10:59 ` [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP Yitao Jiang
@ 2026-06-25 10:59 ` Yitao Jiang
2026-06-25 11:26 ` sashiko-bot
2026-06-25 12:36 ` Christian König
2026-06-25 10:59 ` [PATCH 3/3] drm/amdkfd: block THP for non-replayable SVM ranges Yitao Jiang
` (2 subsequent siblings)
4 siblings, 2 replies; 16+ messages in thread
From: Yitao Jiang @ 2026-06-25 10:59 UTC (permalink / raw)
To: Alex Deucher, Christian König, David Airlie, Simona Vetter,
Felix Kuehling, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
linux-kernel, linux-mm, Yitao Jiang
HSA userptr buffer objects are used by KFD compute queues. On systems
where the GPU cannot reliably tolerate a CPU THP remap of an active
userptr range, allowing khugepaged or MADV_COLLAPSE to replace PTE
mappings with a PMD mapping can leave later GPU work failing
asynchronously.
Register HSA userptr interval notifiers with
MMU_INTERVAL_NOTIFIER_BLOCK_THP. GFX userptrs keep the existing
notifier path and do not opt in.
Assisted-by: OpenAI-Codex:GPT-5.5
Signed-off-by: Yitao Jiang <jytscientist@hotmail.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 25 +++++++++++++++++--------
1 file changed, 17 insertions(+), 8 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
index 99bc9ad67..c0b36164c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
@@ -44,6 +44,7 @@
*/
#include <linux/firmware.h>
+#include <linux/mm.h>
#include <linux/module.h>
#include <drm/drm.h>
@@ -130,16 +131,24 @@ static const struct mmu_interval_notifier_ops amdgpu_hmm_hsa_ops = {
*/
int amdgpu_hmm_register(struct amdgpu_bo *bo, unsigned long addr)
{
+ struct mm_struct *mm = current->mm;
+ unsigned long size = amdgpu_bo_size(bo);
int r;
- if (bo->kfd_bo)
- r = mmu_interval_notifier_insert(&bo->notifier, current->mm,
- addr, amdgpu_bo_size(bo),
- &amdgpu_hmm_hsa_ops);
- else
- r = mmu_interval_notifier_insert(&bo->notifier, current->mm, addr,
- amdgpu_bo_size(bo),
- &amdgpu_hmm_gfx_ops);
+ if (unlikely(!mm))
+ return -ESRCH;
+
+ if (bo->kfd_bo) {
+ mmap_write_lock(mm);
+ r = mmu_interval_notifier_insert_locked_flags(&bo->notifier, mm,
+ addr, size,
+ &amdgpu_hmm_hsa_ops,
+ MMU_INTERVAL_NOTIFIER_BLOCK_THP);
+ mmap_write_unlock(mm);
+ } else {
+ r = mmu_interval_notifier_insert(&bo->notifier, mm, addr, size,
+ &amdgpu_hmm_gfx_ops);
+ }
if (r)
/*
* Make sure amdgpu_hmm_unregister() doesn't call
--
2.53.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/3] drm/amdkfd: block THP for non-replayable SVM ranges
2026-06-25 10:59 [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings Yitao Jiang
2026-06-25 10:59 ` [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP Yitao Jiang
2026-06-25 10:59 ` [PATCH 2/3] drm/amdgpu: block THP for HSA userptr notifiers Yitao Jiang
@ 2026-06-25 10:59 ` Yitao Jiang
2026-06-25 11:11 ` sashiko-bot
2026-06-25 11:47 ` [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings David Hildenbrand (Arm)
2026-06-25 12:35 ` Christian König
4 siblings, 1 reply; 16+ messages in thread
From: Yitao Jiang @ 2026-06-25 10:59 UTC (permalink / raw)
To: Alex Deucher, Christian König, David Airlie, Simona Vetter,
Felix Kuehling, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
linux-kernel, linux-mm, Yitao Jiang
KFD SVM ranges on processes without XNACK, and ranges requested as
GPU_ALWAYS_MAPPED, cannot rely on replayable GPU faults after a CPU THP
remap of the registered VA range. Keep those ranges backed by base
pages while their interval notifier is active.
Opt those SVM interval notifiers into MMU_INTERVAL_NOTIFIER_BLOCK_THP
and update the flag when SVM attributes change. XNACK-enabled ranges
that can handle remaps through replayable faults remain eligible for
THP unless GPU_ALWAYS_MAPPED is requested.
Assisted-by: OpenAI-Codex:GPT-5.5
Signed-off-by: Yitao Jiang <jytscientist@hotmail.com>
---
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 36 ++++++++++++++++++++++++----
1 file changed, 32 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 3841943da..0d0feba7b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -22,6 +22,7 @@
*/
#include <linux/types.h>
+#include <linux/mm.h>
#include <linux/sched/task.h>
#include <linux/dynamic_debug.h>
#include <drm/ttm/ttm_tt.h>
@@ -81,6 +82,26 @@ static const struct mmu_interval_notifier_ops svm_range_mn_ops = {
.invalidate = svm_range_cpu_invalidate_pagetables,
};
+static unsigned int
+svm_range_mn_flags(struct svm_range *prange)
+{
+ struct kfd_process *p = container_of(prange->svms, struct kfd_process,
+ svms);
+
+ if (!p->xnack_enabled ||
+ (prange->flags & KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED))
+ return MMU_INTERVAL_NOTIFIER_BLOCK_THP;
+
+ return 0;
+}
+
+static int
+svm_range_update_mn_flags_locked(struct svm_range *prange)
+{
+ return mmu_interval_notifier_set_flags_locked(&prange->notifier,
+ svm_range_mn_flags(prange));
+}
+
/**
* svm_range_unlink - unlink svm_range from lists and interval tree
* @prange: svm range structure to be removed
@@ -112,10 +133,11 @@ svm_range_add_notifier_locked(struct mm_struct *mm, struct svm_range *prange)
pr_debug("svms 0x%p prange 0x%p [0x%lx 0x%lx]\n", prange->svms,
prange, prange->start, prange->last);
- mmu_interval_notifier_insert_locked(&prange->notifier, mm,
- prange->start << PAGE_SHIFT,
- prange->npages << PAGE_SHIFT,
- &svm_range_mn_ops);
+ mmu_interval_notifier_insert_locked_flags(&prange->notifier, mm,
+ prange->start << PAGE_SHIFT,
+ prange->npages << PAGE_SHIFT,
+ &svm_range_mn_ops,
+ svm_range_mn_flags(prange));
}
/**
@@ -3763,6 +3785,12 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
}
list_for_each_entry(prange, &update_list, update_list) {
svm_range_apply_attrs(p, prange, nattr, attrs, &update_mapping);
+ r = svm_range_update_mn_flags_locked(prange);
+ if (r) {
+ mutex_unlock(&svms->lock);
+ mmap_write_unlock(mm);
+ goto out;
+ }
/* TODO: unmap ranges from GPU that lost access */
}
update_mapping |= !p->xnack_enabled && !list_empty(&remap_list);
--
2.53.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] drm/amdkfd: block THP for non-replayable SVM ranges
2026-06-25 10:59 ` [PATCH 3/3] drm/amdkfd: block THP for non-replayable SVM ranges Yitao Jiang
@ 2026-06-25 11:11 ` sashiko-bot
0 siblings, 0 replies; 16+ messages in thread
From: sashiko-bot @ 2026-06-25 11:11 UTC (permalink / raw)
To: Yitao Jiang; +Cc: dri-devel
Thank you for your contribution! Sashiko AI review found 5 potential issue(s) to consider:
New issues:
- [High] Error handling in `svm_range_set_attr` aborts the transaction mid-flight without rollback, leaving overlapping ranges in the SVM interval tree and leaking memory.
- [High] `svm_range_add_notifier_locked` silently ignores failures when registering the MMU interval notifier, leading to WARN_ON_ONCE and interval tree corruption.
- [High] Clearing the `KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED` attribute permanently leaks a THP block node and permanently disables THP for the virtual address range.
- [High] Dynamically disabling XNACK via ioctl does not update SVM range notifier flags, leaving THP unblocked and risking GPU page faults.
Pre-existing issues:
- [High] If a user maps SVM memory at virtual address 0, its MMU interval notifier is never removed, causing a Use-After-Free.
--
commit f4df270ac94ba46190c602a81dedff1aa60cc86a
Author: Yitao Jiang <jytscientist@hotmail.com>
drm/amdkfd: block THP for non-replayable SVM ranges
KFD SVM ranges on processes without XNACK, and ranges requested as
GPU_ALWAYS_MAPPED, cannot rely on replayable GPU faults after a CPU THP
remap of the registered VA range. Keep those ranges backed by base
pages while their interval notifier is active.
Opt those SVM interval notifiers into MMU_INTERVAL_NOTIFIER_BLOCK_THP
and update the flag when SVM attributes change. XNACK-enabled ranges
that can handle remaps through replayable faults remain eligible for
THP unless GPU_ALWAYS_MAPPED is requested.
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 3841943da5ece..0d0feba7b7852 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
[ ... ]
> @@ -81,6 +82,26 @@ static const struct mmu_interval_notifier_ops svm_range_mn_ops = {
> .invalidate = svm_range_cpu_invalidate_pagetables,
> };
>
> +static unsigned int
> +svm_range_mn_flags(struct svm_range *prange)
> +{
> + struct kfd_process *p = container_of(prange->svms, struct kfd_process,
> + svms);
> +
> + if (!p->xnack_enabled ||
> + (prange->flags & KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED))
> + return MMU_INTERVAL_NOTIFIER_BLOCK_THP;
> +
> + return 0;
> +}
[Severity: High]
When a process dynamically disables XNACK using the KFD_IOC_SMI_EVENTS ioctl,
does this leave THP unblocked on existing ranges?
In svm_range_switch_xnack_reserve_mem(), p->xnack_enabled is updated
dynamically:
p->xnack_enabled = xnack_enabled;
However, it does not seem to update the interval notifier flags for existing
SVM ranges. Because the existing ranges remain with THP unblocked, could
khugepaged collapse THPs and generate unrecoverable GPU page faults since
XNACK is disabled?
> +
> +static int
> +svm_range_update_mn_flags_locked(struct svm_range *prange)
> +{
> + return mmu_interval_notifier_set_flags_locked(&prange->notifier,
> + svm_range_mn_flags(prange));
> +}
[Severity: High]
If userspace clears the KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED flag or XNACK
is enabled, does this permanently leak a THP block node?
When svm_range_update_mn_flags_locked() calls
mmu_interval_notifier_set_flags_locked() with flags = 0, the
interval_sub->flags are zeroed out locally. However, since there is no
teardown path to call mmu_interval_notifier_unblock_thp_locked(), the THP
block node remains.
When the SVM range is eventually freed, mmu_interval_notifier_remove() will
also skip the unblock path because the flag was cleared, resulting in a
permanent memory leak and permanently disabling THP for the virtual address
range.
[ ... ]
> @@ -112,10 +133,11 @@ svm_range_add_notifier_locked(struct mm_struct *mm, struct svm_range *prange)
> pr_debug("svms 0x%p prange 0x%p [0x%lx 0x%lx]\n", prange->svms,
> prange, prange->start, prange->last);
>
> - mmu_interval_notifier_insert_locked(&prange->notifier, mm,
> - prange->start << PAGE_SHIFT,
> - prange->npages << PAGE_SHIFT,
> - &svm_range_mn_ops);
> + mmu_interval_notifier_insert_locked_flags(&prange->notifier, mm,
> + prange->start << PAGE_SHIFT,
> + prange->npages << PAGE_SHIFT,
> + &svm_range_mn_ops,
> + svm_range_mn_flags(prange));
> }
[Severity: High]
Does this silently ignore failures during interval notifier registration?
If mmu_interval_notifier_insert_locked_flags() fails, it returns an error
code and prange->notifier.mm remains NULL. Because the return value is
discarded here, subsequent code in svm_range_set_attr() will call
svm_range_update_mn_flags_locked() on the same transaction. This passes the
uninitialized notifier to mmu_interval_notifier_set_flags_locked(), which
will hit a WARN_ON_ONCE(!mm), return -EINVAL, and cause the transaction to
abort, potentially corrupting the interval tree.
[ ... ]
> @@ -3763,6 +3785,12 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
> }
> list_for_each_entry(prange, &update_list, update_list) {
> svm_range_apply_attrs(p, prange, nattr, attrs, &update_mapping);
> + r = svm_range_update_mn_flags_locked(prange);
> + if (r) {
> + mutex_unlock(&svms->lock);
> + mmap_write_unlock(mm);
> + goto out;
> + }
[Severity: High]
Does aborting the transaction here leak memory and corrupt the interval tree?
If svm_range_update_mn_flags_locked() fails, the goto out path completely
bypasses the remove_list processing located further down in the function.
This leaves old cloned ranges permanently in the interval tree alongside
their newly inserted replacements that were added during the insert_list
loop. Additionally, because remove_list is a stack variable, skipping its
teardown leaks the ranges and leaves dangling pointers in the update_list
links.
> /* TODO: unmap ranges from GPU that lost access */
> }
> update_mapping |= !p->xnack_enabled && !list_empty(&remap_list);
[Severity: High]
This is a pre-existing issue, but does svm_range_remove_notifier() fail to
remove the notifier if the SVM range starts at virtual address 0?
In drivers/gpu/drm/amd/amdkfd/kfd_svm.c:svm_range_remove_notifier():
if (prange->notifier.interval_tree.start != 0 &&
prange->notifier.interval_tree.last != 0)
mmu_interval_notifier_remove(&prange->notifier);
If an SVM range is mapped starting at virtual address 0 (which is permitted
if vm.mmap_min_addr is 0 or if the user has CAP_SYS_RAWIO), the check
evaluates to false and incorrectly skips mmu_interval_notifier_remove().
The prange is subsequently freed, but its embedded notifier structure
remains linked in the mm->notifier_subscriptions->itree, which will cause a
use-after-free the next time the tree is modified.
--
Sashiko AI review · https://sashiko.dev/#/patchset/SY1PR01MB10596EB75463208A8E1EBBA0FC0EC2@SY1PR01MB10596.ausprd01.prod.outlook.com?part=3
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP
2026-06-25 10:59 ` [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP Yitao Jiang
@ 2026-06-25 11:18 ` sashiko-bot
2026-06-25 11:50 ` David Hildenbrand (Arm)
2026-06-25 11:58 ` Lorenzo Stoakes
2 siblings, 0 replies; 16+ messages in thread
From: sashiko-bot @ 2026-06-25 11:18 UTC (permalink / raw)
To: Yitao Jiang; +Cc: dri-devel
Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] Bypass of THP protections via VMA modification and logic bug in hugepage_madvise().
- [High] mmu_interval_notifier_range_block_thp() misses notifiers on the deferred_list, allowing a concurrent MADV_HUGEPAGE to re-enable THP.
- [Medium] MADV_HUGEPAGE requests are silently dropped for valid memory regions due to coarse VMA-level overlap checks.
- [Medium] VMAs with the VM_NO_KHUGEPAGED flag are incorrectly skipped during THP blocking, allowing synchronous THP allocations.
--
commit 70ad5b57ebfe62382bb716c4e65db9bc1f0ee638
Author: Yitao Jiang <jytscientist@hotmail.com>
Subject: mm/mmu_notifier: let interval notifiers block THP
This patch introduces MMU_INTERVAL_NOTIFIER_BLOCK_THP to allow interval
notifiers to request blocking of transparent huge pages on covered VMA ranges.
The MM core handles updating the VMA flags (setting VM_NOHUGEPAGE and clearing
VM_HUGEPAGE) during registration instead of drivers doing it directly.
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 617bca76db49b..a9b05e71686c5 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
[ ... ]
> @@ -445,11 +445,16 @@ static unsigned int collapse_max_ptes_swap(struct collapse_control *cc,
> return khugepaged_max_ptes_swap;
> }
>
> -int hugepage_madvise(struct vm_area_struct *vma,
> - vm_flags_t *vm_flags, int advice)
> +int hugepage_madvise(struct vm_area_struct *vma, unsigned long start,
> + unsigned long end, vm_flags_t *vm_flags, int advice)
> {
> switch (advice) {
> case MADV_HUGEPAGE:
> + if ((*vm_flags & VM_NOHUGEPAGE) &&
> + mmu_interval_notifier_range_block_thp(vma->vm_mm,
> + start, end))
> + return 0;
> +
[Severity: High]
If a new VMA is created within the notifier's range (e.g., via mremap or a
new mmap with MAP_FIXED), it will be created with default flags lacking
VM_NOHUGEPAGE, because mmu_interval_notifier_block_thp_locked() only
statically applies flags to existing VMAs. Because this check short-circuits
when VM_NOHUGEPAGE is absent, it seems it would evaluate to false and allow
MADV_HUGEPAGE to succeed. Does this allow userspace to bypass the THP blocking
protection for newly created VMAs?
[Severity: Medium]
Additionally, if a VMA was previously marked with VM_NOHUGEPAGE before the
notifier was registered, it may not have been split if its flags already
matched. If a user later calls MADV_HUGEPAGE on that entire VMA, this
condition will return 0 and ignore the hint completely. Should this logic
only reject the hint for the specific regions overlapping the notifier,
rather than dropping valid MADV_HUGEPAGE requests for the rest of the VMA?
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 245b74f39f912..852a5682b07f2 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
[ ... ]
> +bool mmu_interval_notifier_range_block_thp(struct mm_struct *mm,
> + unsigned long start,
> + unsigned long end)
> +{
> + struct mmu_notifier_subscriptions *subscriptions;
> + struct mmu_interval_notifier *interval_sub;
> + struct interval_tree_node *node;
> + bool block_thp = false;
> +
> + if (start >= end)
> + return false;
> +
> + /* Pairs with the store in mmu_notifier_register(). */
> + subscriptions = smp_load_acquire(&mm->notifier_subscriptions);
> + if (!subscriptions || !subscriptions->has_itree)
> + return false;
> +
> + spin_lock(&subscriptions->lock);
> + for (node = interval_tree_iter_first(&subscriptions->itree, start,
> + end - 1);
> + node;
> + node = interval_tree_iter_next(node, start, end - 1)) {
> + interval_sub = container_of(node, struct mmu_interval_notifier,
> + interval_tree);
> + if (interval_sub->flags & MMU_INTERVAL_NOTIFIER_BLOCK_THP) {
> + block_thp = true;
> + break;
> + }
> + }
> + spin_unlock(&subscriptions->lock);
> +
> + return block_thp;
> +}
[Severity: High]
When an interval notifier is registered concurrently with an active page
table invalidation, __mmu_interval_notifier_insert() defers tree insertion
by placing the notifier on subscriptions->deferred_list. Since this loop
only inspects subscriptions->itree, could it miss active notifiers on the
deferred list and mistakenly allow a concurrent MADV_HUGEPAGE request to
re-enable THP?
[ ... ]
> +static int
> +mmu_interval_notifier_block_thp_locked(struct mm_struct *mm,
> + unsigned long start,
> + unsigned long end)
> +{
> + struct vm_area_struct *vma, *prev;
> + struct vma_iterator vmi;
> +
> + mmap_assert_write_locked(mm);
> +
> + vma_iter_init(&vmi, mm, start);
> + vma = vma_iter_load(&vmi);
> + prev = vma_prev(&vmi);
> + if (vma && start > vma->vm_start)
> + prev = vma;
> +
> + for_each_vma_range(vmi, vma, end) {
> + const unsigned long curr_start = max(vma->vm_start, start);
> + const unsigned long curr_end = min(vma->vm_end, end);
> + vma_flags_t new_flags;
> +
> + if (vma->vm_flags & VM_NO_KHUGEPAGED)
> + goto next;
[Severity: Medium]
The VM_NO_KHUGEPAGED flag only disables background scanning by khugepaged,
but it does not prevent transparent huge pages from being synchronously
allocated and mapped during page faults. By skipping these VMAs and not
setting VM_NOHUGEPAGE, does this leave them vulnerable to synchronous THP
allocations, thereby breaking the protection required by the notifier?
--
Sashiko AI review · https://sashiko.dev/#/patchset/SY1PR01MB10596EB75463208A8E1EBBA0FC0EC2@SY1PR01MB10596.ausprd01.prod.outlook.com?part=1
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] drm/amdgpu: block THP for HSA userptr notifiers
2026-06-25 10:59 ` [PATCH 2/3] drm/amdgpu: block THP for HSA userptr notifiers Yitao Jiang
@ 2026-06-25 11:26 ` sashiko-bot
2026-06-25 12:36 ` Christian König
1 sibling, 0 replies; 16+ messages in thread
From: sashiko-bot @ 2026-06-25 11:26 UTC (permalink / raw)
To: Yitao Jiang; +Cc: dri-devel
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] Unaligned user_addr from KFD_IOC_ALLOC_MEMORY_OF_GPU causes VMA tree corruption via split_vma()
--
commit 84a588c64a29eedb721607110e005505178f21fc
Author: Yitao Jiang <jytscientist@hotmail.com>
drm/amdgpu: block THP for HSA userptr notifiers
This commit modifies the registration of HSA userptr interval notifiers
to use the MMU_INTERVAL_NOTIFIER_BLOCK_THP flag. This prevents CPU THP
remaps of active userptr ranges from causing asynchronous GPU work failures.
GFX userptrs continue to use the existing notifier path.
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> index 99bc9ad67d5b8..c0b36164c35e2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
[ ... ]
> @@ -130,16 +131,24 @@ static const struct mmu_interval_notifier_ops amdgpu_hmm_hsa_ops = {
> */
> int amdgpu_hmm_register(struct amdgpu_bo *bo, unsigned long addr)
> {
[ ... ]
> + if (bo->kfd_bo) {
> + mmap_write_lock(mm);
> + r = mmu_interval_notifier_insert_locked_flags(&bo->notifier, mm,
> + addr, size,
> + &amdgpu_hmm_hsa_ops,
> + MMU_INTERVAL_NOTIFIER_BLOCK_THP);
[Severity: Critical]
Can addr be unaligned when passed to this function?
A user space process can call the KFD_IOC_ALLOC_MEMORY_OF_GPU ioctl with the
KFD_IOC_ALLOC_MEM_FLAGS_USERPTR flag and provide an unaligned virtual address
in args->mmap_offset. The ioctl handler in kfd_ioctl_alloc_memory_of_gpu()
(in drivers/gpu/drm/amd/amdkfd/kfd_chardev.c) does not enforce page alignment
on this offset:
drivers/gpu/drm/amd/amdkfd/kfd_chardev.c:kfd_ioctl_alloc_memory_of_gpu() {
...
err = amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
dev->adev, args->va_addr, args->size,
pdd->drm_priv, (struct kgd_mem **) &mem, &offset,
flags, false);
...
}
The unaligned offset propagates through init_user_pages() unmodified as
user_addr, which is eventually passed here as addr to amdgpu_hmm_register().
With the MMU_INTERVAL_NOTIFIER_BLOCK_THP flag added by this patch, the core
MM calls mmu_interval_notifier_block_thp_locked(), which uses this addr as
the start address. It iteratively passes this address to vma_modify_flags()
and subsequently split_vma().
Since VMA boundaries must be strictly page-aligned, split_vma() will corrupt
the VMA tree and break invariants by applying the unaligned address to the new
VMA bounds. This can lead to kernel panics or a local denial-of-service.
Should addr and size be aligned before calling
mmu_interval_notifier_insert_locked_flags(), or should the ioctl enforce
page alignment on user pointers?
--
Sashiko AI review · https://sashiko.dev/#/patchset/SY1PR01MB10596EB75463208A8E1EBBA0FC0EC2@SY1PR01MB10596.ausprd01.prod.outlook.com?part=2
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
2026-06-25 10:59 [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings Yitao Jiang
` (2 preceding siblings ...)
2026-06-25 10:59 ` [PATCH 3/3] drm/amdkfd: block THP for non-replayable SVM ranges Yitao Jiang
@ 2026-06-25 11:47 ` David Hildenbrand (Arm)
2026-06-25 11:54 ` Lorenzo Stoakes
2026-06-25 12:35 ` Christian König
4 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-25 11:47 UTC (permalink / raw)
To: Yitao Jiang, Alex Deucher, Christian König, David Airlie,
Simona Vetter, Felix Kuehling, Andrew Morton, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
linux-kernel, linux-mm
On 6/25/26 12:59, Yitao Jiang wrote:
> Hi,
>
> This series fixes a THP policy problem I found while debugging
> frequent ROCm GPU failures on an AMD Radeon 780M system during ML
> training.
>
> Some AMDGPU/KFD user mappings are registered through interval
> notifiers and cannot safely tolerate the backing VMA changing from base
> pages to a transparent huge page after registration. Userspace can
> still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
> collapse the range, after the GPU mapping has been registered.
Huh, why? As a memory notifier user, you must be prepared from memory to get
unmapped+remapped at random points in time.
What is the precise problem here? How are you handling THPs at registration time?
Letting arbitrary drivers make THP policies sounds like the very wrong approach.
--
Cheers,
David
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP
2026-06-25 10:59 ` [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP Yitao Jiang
2026-06-25 11:18 ` sashiko-bot
@ 2026-06-25 11:50 ` David Hildenbrand (Arm)
2026-06-25 11:58 ` Lorenzo Stoakes
2 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-25 11:50 UTC (permalink / raw)
To: Yitao Jiang, Alex Deucher, Christian König, David Airlie,
Simona Vetter, Felix Kuehling, Andrew Morton, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
linux-kernel, linux-mm
On 6/25/26 12:59, Yitao Jiang wrote:
> Some secondary MMUs cannot safely tolerate a user VMA becoming backed
> by transparent huge pages after the range has been registered with an
> interval notifier. Drivers can observe the page-table layout change
> through invalidations, but devices without replayable faults, or ranges
> that must stay mapped,
Then you shouldn't be using MMU notifiers.
Use good old nasty page pinning. :)
--
Cheers,
David
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
2026-06-25 11:47 ` [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings David Hildenbrand (Arm)
@ 2026-06-25 11:54 ` Lorenzo Stoakes
2026-06-25 12:14 ` 回复: " 蒋 亦韬
0 siblings, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes @ 2026-06-25 11:54 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Yitao Jiang, Alex Deucher, Christian König, David Airlie,
Simona Vetter, Felix Kuehling, Andrew Morton, Zi Yan, Baolin Wang,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
Michal Hocko, Jann Horn, amd-gfx, dri-devel, linux-kernel,
linux-mm
NAK to this or any version of this.
This series is insane and the idea is insane.
On Thu, Jun 25, 2026 at 01:47:25PM +0200, David Hildenbrand (Arm) wrote:
> On 6/25/26 12:59, Yitao Jiang wrote:
> > Hi,
> >
> > This series fixes a THP policy problem I found while debugging
> > frequent ROCm GPU failures on an AMD Radeon 780M system during ML
> > training.
> >
> > Some AMDGPU/KFD user mappings are registered through interval
> > notifiers and cannot safely tolerate the backing VMA changing from base
> > pages to a transparent huge page after registration. Userspace can
> > still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
> > collapse the range, after the GPU mapping has been registered.
>
> Huh, why? As a memory notifier user, you must be prepared from memory to get
> unmapped+remapped at random points in time.
>
> What is the precise problem here? How are you handling THPs at registration time?
>
> Letting arbitrary drivers make THP policies sounds like the very wrong approach.
We absolutely will not _ever_ allow drivers to do this while I still breath :)
>
> --
> Cheers,
>
> David
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP
2026-06-25 10:59 ` [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP Yitao Jiang
2026-06-25 11:18 ` sashiko-bot
2026-06-25 11:50 ` David Hildenbrand (Arm)
@ 2026-06-25 11:58 ` Lorenzo Stoakes
2 siblings, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2026-06-25 11:58 UTC (permalink / raw)
To: Yitao Jiang
Cc: Alex Deucher, Christian König, David Airlie, Simona Vetter,
Felix Kuehling, Andrew Morton, David Hildenbrand, Zi Yan,
Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
linux-kernel, linux-mm
On Thu, Jun 25, 2026 at 06:59:51PM +0800, Yitao Jiang wrote:
> Assisted-by: OpenAI-Codex:GPT-5.5
Thanks for acking AI involvement, that's appreciated (and there appears to be a
fair but of unacknowledged AI-generated code being submitted at the moment).
However, may I gnetly direct you towards the last few paragraphs in this document:
https://origin.kernel.org/doc/html/latest/process/generated-content.html
I think this MAY be a case of the AI possibly misleading you into a crazy idea
when, as David points out, page pinning is what you need :)
We don't bite, if you have a problem that you need to solve, feel free to email
linux-mm@kvack.org and relevant maintainers/reviewers from MAINTAINERS with a
'[DISCUSSION]' or '[QUESTION]'-prefixed thread and we can help you ahead of
time.
Thanks, Lorenzo
^ permalink raw reply [flat|nested] 16+ messages in thread
* 回复: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
2026-06-25 11:54 ` Lorenzo Stoakes
@ 2026-06-25 12:14 ` 蒋 亦韬
0 siblings, 0 replies; 16+ messages in thread
From: 蒋 亦韬 @ 2026-06-25 12:14 UTC (permalink / raw)
To: Lorenzo Stoakes, David Hildenbrand (Arm)
Cc: Alex Deucher, Christian König, David Airlie, Simona Vetter,
Felix Kuehling, Andrew Morton, Zi Yan, Baolin Wang,
Liam R . Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Lance Yang, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan,
Michal Hocko, Jann Horn, amd-gfx@lists.freedesktop.org,
dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org
[-- Attachment #1: Type: text/plain, Size: 3691 bytes --]
Hi David, Lorenzo,
Thank you for the patient and direct feedback.
You are right; I misjudged the scope and abstraction here. My initial local fix was in the AMD driver path and addressed the failure I was seeing there. I then tried to move the solution into MM core because I guessed similar notifier users might hit the same class of problem. David's explanation makes clear that this was the wrong model: an MMU notifier user must tolerate unmap/remap, and mappings that cannot tolerate that need a different mechanism, such as page pinning, not a driver-controlled THP policy in MM core.
Sorry for the noise and for taking reviewer time. I appreciate the explanation, since it corrected my understanding of the expected MMU notifier and THP semantics.
On the AI assistance: I disclosed it because it was involved, and I did review the generated code against the behavior I thought I wanted. The failure here was my own misunderstanding of the MM core contract, which led to an inappropriate patch despite that review.
I will drop this series and will not send a v2 for this approach. I will re-scope the work to the AMDGPU/KFD side, with a minimal reproducer and a discussion/question first if MM input is needed, rather than proposing MM core changes.
Thanks again,
Yitao
________________________________
发件人: Lorenzo Stoakes <ljs@kernel.org>
发送时间: 2026年6月25日 7:54
收件人: David Hildenbrand (Arm) <david@kernel.org>
抄送: Yitao Jiang <jytscientist@hotmail.com>; Alex Deucher <alexander.deucher@amd.com>; Christian König <christian.koenig@amd.com>; David Airlie <airlied@gmail.com>; Simona Vetter <simona@ffwll.ch>; Felix Kuehling <Felix.Kuehling@amd.com>; Andrew Morton <akpm@linux-foundation.org>; Zi Yan <ziy@nvidia.com>; Baolin Wang <baolin.wang@linux.alibaba.com>; Liam R . Howlett <liam@infradead.org>; Nico Pache <npache@redhat.com>; Ryan Roberts <ryan.roberts@arm.com>; Dev Jain <dev.jain@arm.com>; Barry Song <baohua@kernel.org>; Lance Yang <lance.yang@linux.dev>; Vlastimil Babka <vbabka@kernel.org>; Mike Rapoport <rppt@kernel.org>; Suren Baghdasaryan <surenb@google.com>; Michal Hocko <mhocko@suse.com>; Jann Horn <jannh@google.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-mm@kvack.org <linux-mm@kvack.org>
主题: Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
NAK to this or any version of this.
This series is insane and the idea is insane.
On Thu, Jun 25, 2026 at 01:47:25PM +0200, David Hildenbrand (Arm) wrote:
> On 6/25/26 12:59, Yitao Jiang wrote:
> > Hi,
> >
> > This series fixes a THP policy problem I found while debugging
> > frequent ROCm GPU failures on an AMD Radeon 780M system during ML
> > training.
> >
> > Some AMDGPU/KFD user mappings are registered through interval
> > notifiers and cannot safely tolerate the backing VMA changing from base
> > pages to a transparent huge page after registration. Userspace can
> > still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
> > collapse the range, after the GPU mapping has been registered.
>
> Huh, why? As a memory notifier user, you must be prepared from memory to get
> unmapped+remapped at random points in time.
>
> What is the precise problem here? How are you handling THPs at registration time?
>
> Letting arbitrary drivers make THP policies sounds like the very wrong approach.
We absolutely will not _ever_ allow drivers to do this while I still breath :)
>
> --
> Cheers,
>
> David
Thanks, Lorenzo
[-- Attachment #2: Type: text/html, Size: 6459 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
2026-06-25 10:59 [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings Yitao Jiang
` (3 preceding siblings ...)
2026-06-25 11:47 ` [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings David Hildenbrand (Arm)
@ 2026-06-25 12:35 ` Christian König
2026-06-25 13:01 ` 回复: " 蒋 亦韬
4 siblings, 1 reply; 16+ messages in thread
From: Christian König @ 2026-06-25 12:35 UTC (permalink / raw)
To: Yitao Jiang, Alex Deucher, David Airlie, Simona Vetter,
Felix Kuehling, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
linux-kernel, linux-mm
On 6/25/26 12:59, Yitao Jiang wrote:
> Hi,
>
> This series fixes a THP policy problem I found while debugging
> frequent ROCm GPU failures on an AMD Radeon 780M system during ML
> training.
>
> Some AMDGPU/KFD user mappings are registered through interval
> notifiers and cannot safely tolerate the backing VMA changing from base
> pages to a transparent huge page after registration.
That's certainly not correct. This is a must have for a whole lot of use cases.
Why exactly isn't that working for your use case?
Regards,
Christian.
> Userspace can
> still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
> collapse the range, after the GPU mapping has been registered.
>
> On my system this showed up as asynchronous ROCm/HIP kernel launch
> failures, often reported later at a synchronization or copy point. I
> expect the issue to be relevant to AMDGPU/KFD mappings on
> XNACK-disabled GPUs more generally, because those mappings cannot rely
> on replayable GPU faults after a CPU-side THP remap. I have validated
> the failure and fix on AMD Radeon 780M / gfx1103.
>
> Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
> users can ask the MM core to keep the covered VMA range out of THP
> while the notifier is active. The MM core applies VM_NOHUGEPAGE and
> clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
> over an active opt-in range is treated as an ignored hint, and
> MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.
>
> Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
> HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
> GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
> current behavior.
>
> This does not disable THP globally and does not add work to GPU
> command submission or kernel launch paths. Additional work is limited
> to opt-in notifier registration, opt-in notifier flag transitions, and
> MADV_HUGEPAGE attempts that overlap an active opt-in range.
>
> I tested this on top of torvalds/linux commit ab9de95c9cf9 with:
>
> - scripts/checkpatch.pl --strict --no-tree
> - git apply --check
> - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
> DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
> - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
> originally exposed the failure on my Radeon 780M system
>
> The standalone reproducers depend on ROCm userspace libraries, so I
> have not included them in this series. I can send them separately if
> useful.
>
> This series was prepared with assistance from OpenAI Codex (GPT-5.5).
> I reviewed the resulting code and take responsibility for the
> submission.
>
> Yitao Jiang (3):
> mm/mmu_notifier: let interval notifiers block THP
> drm/amdgpu: block THP for HSA userptr notifiers
> drm/amdkfd: block THP for non-replayable SVM ranges
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 25 ++-
> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 36 ++++-
> include/linux/huge_mm.h | 5 +-
> include/linux/mmu_notifier.h | 28 ++++
> mm/khugepaged.c | 9 +-
> mm/madvise.c | 3 +-
> mm/mmu_notifier.c | 204 +++++++++++++++++++++++-
> 7 files changed, 286 insertions(+), 24 deletions(-)
>
>
> base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
> --
> 2.53.0
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/3] drm/amdgpu: block THP for HSA userptr notifiers
2026-06-25 10:59 ` [PATCH 2/3] drm/amdgpu: block THP for HSA userptr notifiers Yitao Jiang
2026-06-25 11:26 ` sashiko-bot
@ 2026-06-25 12:36 ` Christian König
1 sibling, 0 replies; 16+ messages in thread
From: Christian König @ 2026-06-25 12:36 UTC (permalink / raw)
To: Yitao Jiang, Alex Deucher, David Airlie, Simona Vetter,
Felix Kuehling, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
linux-kernel, linux-mm
On 6/25/26 12:59, Yitao Jiang wrote:
> [Some people who received this message don't often get email from jytscientist@hotmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> HSA userptr buffer objects are used by KFD compute queues. On systems
> where the GPU cannot reliably tolerate a CPU THP remap of an active
> userptr range, allowing khugepaged or MADV_COLLAPSE to replace PTE
> mappings with a PMD mapping can leave later GPU work failing
> asynchronously.
Absolutely clear NAK to this.
That largely sounds like it just work around some issue and is not really a doable fix.
Regards,
Christian.
>
> Register HSA userptr interval notifiers with
> MMU_INTERVAL_NOTIFIER_BLOCK_THP. GFX userptrs keep the existing
> notifier path and do not opt in.
>
> Assisted-by: OpenAI-Codex:GPT-5.5
> Signed-off-by: Yitao Jiang <jytscientist@hotmail.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 25 +++++++++++++++++--------
> 1 file changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> index 99bc9ad67..c0b36164c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> @@ -44,6 +44,7 @@
> */
>
> #include <linux/firmware.h>
> +#include <linux/mm.h>
> #include <linux/module.h>
> #include <drm/drm.h>
>
> @@ -130,16 +131,24 @@ static const struct mmu_interval_notifier_ops amdgpu_hmm_hsa_ops = {
> */
> int amdgpu_hmm_register(struct amdgpu_bo *bo, unsigned long addr)
> {
> + struct mm_struct *mm = current->mm;
> + unsigned long size = amdgpu_bo_size(bo);
> int r;
>
> - if (bo->kfd_bo)
> - r = mmu_interval_notifier_insert(&bo->notifier, current->mm,
> - addr, amdgpu_bo_size(bo),
> - &amdgpu_hmm_hsa_ops);
> - else
> - r = mmu_interval_notifier_insert(&bo->notifier, current->mm, addr,
> - amdgpu_bo_size(bo),
> - &amdgpu_hmm_gfx_ops);
> + if (unlikely(!mm))
> + return -ESRCH;
> +
> + if (bo->kfd_bo) {
> + mmap_write_lock(mm);
> + r = mmu_interval_notifier_insert_locked_flags(&bo->notifier, mm,
> + addr, size,
> + &amdgpu_hmm_hsa_ops,
> + MMU_INTERVAL_NOTIFIER_BLOCK_THP);
> + mmap_write_unlock(mm);
> + } else {
> + r = mmu_interval_notifier_insert(&bo->notifier, mm, addr, size,
> + &amdgpu_hmm_gfx_ops);
> + }
> if (r)
> /*
> * Make sure amdgpu_hmm_unregister() doesn't call
> --
> 2.53.0
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* 回复: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
2026-06-25 12:35 ` Christian König
@ 2026-06-25 13:01 ` 蒋 亦韬
2026-06-25 13:06 ` Christian König
0 siblings, 1 reply; 16+ messages in thread
From: 蒋 亦韬 @ 2026-06-25 13:01 UTC (permalink / raw)
To: Christian König, Alex Deucher, David Airlie, Simona Vetter,
Felix Kuehling, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn,
amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
[-- Attachment #1: Type: text/plain, Size: 7324 bytes --]
Hi Christian,
I agree that my previous approach was wrong. Sorry about that. Please let me clarify the problem I was seeing and how I ended up with that incorrect conclusion.
The original problem was not a synthetic THP test. I was running ROCm/PyTorch ML training on an AMD Radeon 780M system, and the workload frequently failed with asynchronous HIP kernel launch failures. The userspace error usually surfaced later in PyTorch, for example around a copy/to_device/SetDevice path, but the kernel log showed GPU resets and KFD/MES queue eviction failures.
The relevant kernel messages I repeatedly saw were along these lines:
MES failed to respond to msg=REMOVE_QUEUE
MES failed to respond to msg=SUSPEND
failed to suspend all gangs
failed to remove hardware queue from MES
Failed to evict queue
Failed to evict process queues
GPU reset begin
While trying to reduce the issue, I saw memory invalidations and THP-related page-table/backing-page activity driving the AMDGPU/KFD path through SVM eviction. On this system, the path I was looking at was roughly:
svm_range_cpu_invalidate_pagetables()
-> svm_range_evict()
-> kgd2kfd_quiesce_mm()
-> KFD process queue eviction
-> MES REMOVE_QUEUE / SUSPEND
One thing that misled me was the XNACK-disabled path. Since the issue appeared on an XNACK-disabled APU, and that path requires queue eviction/quiesce when CPU page table invalidations affect GPU mappings, I incorrectly thought the backing-page change itself was something the driver had to prevent.
Another thing that misled me was that the application was not intentionally asking for THP behavior. From the workload’s point of view, these page transitions looked unrelated to the model computation. I therefore incorrectly assumed that userspace should not be able to change backing-page characteristics in a way that affects a driver mapping already registered with MMU interval notifiers. I now understand from the MM feedback that this is expected behavior, and that the notifier user must handle unmap/remap correctly.
So the more precise problem is that THP/remap is only one way to trigger the invalidation path. What is failing for my workload is the AMDGPU/KFD/MES queue quiesce/eviction path during those invalidations. When that fails, the GPU resets, and userspace later observes an asynchronous HIP failure.
Please allow me to continue investigating a more appropriate fix for this problem. I will try to keep the fix boundary within AMDGPU/KFD/MES and avoid changing MM-core or THP policy semantics.
Regards,
Yitao
________________________________
发件人: Christian König <christian.koenig@amd.com>
发送时间: 2026年6月25日 8:35
收件人: Yitao Jiang <jytscientist@hotmail.com>; Alex Deucher <alexander.deucher@amd.com>; David Airlie <airlied@gmail.com>; Simona Vetter <simona@ffwll.ch>; Felix Kuehling <Felix.Kuehling@amd.com>; Andrew Morton <akpm@linux-foundation.org>; David Hildenbrand <david@kernel.org>; Lorenzo Stoakes <ljs@kernel.org>
抄送: Zi Yan <ziy@nvidia.com>; Baolin Wang <baolin.wang@linux.alibaba.com>; Liam R . Howlett <liam@infradead.org>; Nico Pache <npache@redhat.com>; Ryan Roberts <ryan.roberts@arm.com>; Dev Jain <dev.jain@arm.com>; Barry Song <baohua@kernel.org>; Lance Yang <lance.yang@linux.dev>; Vlastimil Babka <vbabka@kernel.org>; Mike Rapoport <rppt@kernel.org>; Suren Baghdasaryan <surenb@google.com>; Michal Hocko <mhocko@suse.com>; Jann Horn <jannh@google.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-mm@kvack.org <linux-mm@kvack.org>
主题: Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
On 6/25/26 12:59, Yitao Jiang wrote:
> Hi,
>
> This series fixes a THP policy problem I found while debugging
> frequent ROCm GPU failures on an AMD Radeon 780M system during ML
> training.
>
> Some AMDGPU/KFD user mappings are registered through interval
> notifiers and cannot safely tolerate the backing VMA changing from base
> pages to a transparent huge page after registration.
That's certainly not correct. This is a must have for a whole lot of use cases.
Why exactly isn't that working for your use case?
Regards,
Christian.
> Userspace can
> still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
> collapse the range, after the GPU mapping has been registered.
>
> On my system this showed up as asynchronous ROCm/HIP kernel launch
> failures, often reported later at a synchronization or copy point. I
> expect the issue to be relevant to AMDGPU/KFD mappings on
> XNACK-disabled GPUs more generally, because those mappings cannot rely
> on replayable GPU faults after a CPU-side THP remap. I have validated
> the failure and fix on AMD Radeon 780M / gfx1103.
>
> Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
> users can ask the MM core to keep the covered VMA range out of THP
> while the notifier is active. The MM core applies VM_NOHUGEPAGE and
> clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
> over an active opt-in range is treated as an ignored hint, and
> MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.
>
> Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
> HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
> GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
> current behavior.
>
> This does not disable THP globally and does not add work to GPU
> command submission or kernel launch paths. Additional work is limited
> to opt-in notifier registration, opt-in notifier flag transitions, and
> MADV_HUGEPAGE attempts that overlap an active opt-in range.
>
> I tested this on top of torvalds/linux commit ab9de95c9cf9 with:
>
> - scripts/checkpatch.pl --strict --no-tree
> - git apply --check
> - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
> DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
> - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
> originally exposed the failure on my Radeon 780M system
>
> The standalone reproducers depend on ROCm userspace libraries, so I
> have not included them in this series. I can send them separately if
> useful.
>
> This series was prepared with assistance from OpenAI Codex (GPT-5.5).
> I reviewed the resulting code and take responsibility for the
> submission.
>
> Yitao Jiang (3):
> mm/mmu_notifier: let interval notifiers block THP
> drm/amdgpu: block THP for HSA userptr notifiers
> drm/amdkfd: block THP for non-replayable SVM ranges
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 25 ++-
> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 36 ++++-
> include/linux/huge_mm.h | 5 +-
> include/linux/mmu_notifier.h | 28 ++++
> mm/khugepaged.c | 9 +-
> mm/madvise.c | 3 +-
> mm/mmu_notifier.c | 204 +++++++++++++++++++++++-
> 7 files changed, 286 insertions(+), 24 deletions(-)
>
>
> base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
> --
> 2.53.0
[-- Attachment #2: Type: text/html, Size: 13806 bytes --]
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: 回复: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
2026-06-25 13:01 ` 回复: " 蒋 亦韬
@ 2026-06-25 13:06 ` Christian König
0 siblings, 0 replies; 16+ messages in thread
From: Christian König @ 2026-06-25 13:06 UTC (permalink / raw)
To: 蒋 亦韬, Alex Deucher, David Airlie,
Simona Vetter, Felix Kuehling, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Yang, Philip
Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Jann Horn,
amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Hi Yitao,
adding Philip Yang.
Thanks for the investigation, that sounds like some kind of bug in the KFD SVM handling. The driver should be perfectly capable of handling this.
I strongly suggest to open up a bug report for ROCm and describe how to reproduce this, Philip can probably point you to the right location for that.
Regards,
Christian.
On 6/25/26 15:01, 蒋 亦韬 wrote:
> Hi Christian,
>
> I agree that my previous approach was wrong. Sorry about that. Please let me clarify the problem I was seeing and how I ended up with that incorrect conclusion.
>
> The original problem was not a synthetic THP test. I was running ROCm/PyTorch ML training on an AMD Radeon 780M system, and the workload frequently failed with asynchronous HIP kernel launch failures. The userspace error usually surfaced later in PyTorch, for example around a copy/to_device/SetDevice path, but the kernel log showed GPU resets and KFD/MES queue eviction failures.
>
> The relevant kernel messages I repeatedly saw were along these lines:
>
> MES failed to respond to msg=REMOVE_QUEUE
> MES failed to respond to msg=SUSPEND
> failed to suspend all gangs
> failed to remove hardware queue from MES
> Failed to evict queue
> Failed to evict process queues
> GPU reset begin
>
> While trying to reduce the issue, I saw memory invalidations and THP-related page-table/backing-page activity driving the AMDGPU/KFD path through SVM eviction. On this system, the path I was looking at was roughly:
>
> svm_range_cpu_invalidate_pagetables()
> -> svm_range_evict()
> -> kgd2kfd_quiesce_mm()
> -> KFD process queue eviction
> -> MES REMOVE_QUEUE / SUSPEND
>
> One thing that misled me was the XNACK-disabled path. Since the issue appeared on an XNACK-disabled APU, and that path requires queue eviction/quiesce when CPU page table invalidations affect GPU mappings, I incorrectly thought the backing-page change itself was something the driver had to prevent.
>
> Another thing that misled me was that the application was not intentionally asking for THP behavior. From the workload’s point of view, these page transitions looked unrelated to the model computation. I therefore incorrectly assumed that userspace should not be able to change backing-page characteristics in a way that affects a driver mapping already registered with MMU interval notifiers. I now understand from the MM feedback that this is expected behavior, and that the notifier user must handle unmap/remap correctly.
>
> So the more precise problem is that THP/remap is only one way to trigger the invalidation path. What is failing for my workload is the AMDGPU/KFD/MES queue quiesce/eviction path during those invalidations. When that fails, the GPU resets, and userspace later observes an asynchronous HIP failure.
>
> Please allow me to continue investigating a more appropriate fix for this problem. I will try to keep the fix boundary within AMDGPU/KFD/MES and avoid changing MM-core or THP policy semantics.
>
> Regards,
> Yitao
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *发件人:* Christian König <christian.koenig@amd.com>
> *发送时间:* 2026年6月25日 8:35
> *收件人:* Yitao Jiang <jytscientist@hotmail.com>; Alex Deucher <alexander.deucher@amd.com>; David Airlie <airlied@gmail.com>; Simona Vetter <simona@ffwll.ch>; Felix Kuehling <Felix.Kuehling@amd.com>; Andrew Morton <akpm@linux-foundation.org>; David Hildenbrand <david@kernel.org>; Lorenzo Stoakes <ljs@kernel.org>
> *抄送:* Zi Yan <ziy@nvidia.com>; Baolin Wang <baolin.wang@linux.alibaba.com>; Liam R . Howlett <liam@infradead.org>; Nico Pache <npache@redhat.com>; Ryan Roberts <ryan.roberts@arm.com>; Dev Jain <dev.jain@arm.com>; Barry Song <baohua@kernel.org>; Lance Yang <lance.yang@linux.dev>; Vlastimil Babka <vbabka@kernel.org>; Mike Rapoport <rppt@kernel.org>; Suren Baghdasaryan <surenb@google.com>; Michal Hocko <mhocko@suse.com>; Jann Horn <jannh@google.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>; linux-mm@kvack.org <linux-mm@kvack.org>
> *主题:* Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
>
> On 6/25/26 12:59, Yitao Jiang wrote:
>> Hi,
>>
>> This series fixes a THP policy problem I found while debugging
>> frequent ROCm GPU failures on an AMD Radeon 780M system during ML
>> training.
>>
>> Some AMDGPU/KFD user mappings are registered through interval
>> notifiers and cannot safely tolerate the backing VMA changing from base
>> pages to a transparent huge page after registration.
>
> That's certainly not correct. This is a must have for a whole lot of use cases.
>
> Why exactly isn't that working for your use case?
>
> Regards,
> Christian.
>
>> Userspace can
>> still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
>> collapse the range, after the GPU mapping has been registered.
>>
>> On my system this showed up as asynchronous ROCm/HIP kernel launch
>> failures, often reported later at a synchronization or copy point. I
>> expect the issue to be relevant to AMDGPU/KFD mappings on
>> XNACK-disabled GPUs more generally, because those mappings cannot rely
>> on replayable GPU faults after a CPU-side THP remap. I have validated
>> the failure and fix on AMD Radeon 780M / gfx1103.
>>
>> Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
>> users can ask the MM core to keep the covered VMA range out of THP
>> while the notifier is active. The MM core applies VM_NOHUGEPAGE and
>> clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
>> over an active opt-in range is treated as an ignored hint, and
>> MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.
>>
>> Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
>> HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
>> GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
>> current behavior.
>>
>> This does not disable THP globally and does not add work to GPU
>> command submission or kernel launch paths. Additional work is limited
>> to opt-in notifier registration, opt-in notifier flag transitions, and
>> MADV_HUGEPAGE attempts that overlap an active opt-in range.
>>
>> I tested this on top of torvalds/linux commit ab9de95c9cf9 with:
>>
>> - scripts/checkpatch.pl --strict --no-tree
>> - git apply --check
>> - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
>> DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
>> - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
>> originally exposed the failure on my Radeon 780M system
>>
>> The standalone reproducers depend on ROCm userspace libraries, so I
>> have not included them in this series. I can send them separately if
>> useful.
>>
>> This series was prepared with assistance from OpenAI Codex (GPT-5.5).
>> I reviewed the resulting code and take responsibility for the
>> submission.
>>
>> Yitao Jiang (3):
>> mm/mmu_notifier: let interval notifiers block THP
>> drm/amdgpu: block THP for HSA userptr notifiers
>> drm/amdkfd: block THP for non-replayable SVM ranges
>>
>> drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 25 ++-
>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 36 ++++-
>> include/linux/huge_mm.h | 5 +-
>> include/linux/mmu_notifier.h | 28 ++++
>> mm/khugepaged.c | 9 +-
>> mm/madvise.c | 3 +-
>> mm/mmu_notifier.c | 204 +++++++++++++++++++++++-
>> 7 files changed, 286 insertions(+), 24 deletions(-)
>>
>>
>> base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
>> --
>> 2.53.0
>
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2026-06-25 13:06 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-25 10:59 [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings Yitao Jiang
2026-06-25 10:59 ` [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP Yitao Jiang
2026-06-25 11:18 ` sashiko-bot
2026-06-25 11:50 ` David Hildenbrand (Arm)
2026-06-25 11:58 ` Lorenzo Stoakes
2026-06-25 10:59 ` [PATCH 2/3] drm/amdgpu: block THP for HSA userptr notifiers Yitao Jiang
2026-06-25 11:26 ` sashiko-bot
2026-06-25 12:36 ` Christian König
2026-06-25 10:59 ` [PATCH 3/3] drm/amdkfd: block THP for non-replayable SVM ranges Yitao Jiang
2026-06-25 11:11 ` sashiko-bot
2026-06-25 11:47 ` [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings David Hildenbrand (Arm)
2026-06-25 11:54 ` Lorenzo Stoakes
2026-06-25 12:14 ` 回复: " 蒋 亦韬
2026-06-25 12:35 ` Christian König
2026-06-25 13:01 ` 回复: " 蒋 亦韬
2026-06-25 13:06 ` Christian König
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.