All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
@ 2026-06-25 10:59 Yitao Jiang
  2026-06-25 10:59 ` [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP Yitao Jiang
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Yitao Jiang @ 2026-06-25 10:59 UTC (permalink / raw)
  To: Alex Deucher, Christian König, David Airlie, Simona Vetter,
	Felix Kuehling, Andrew Morton, David Hildenbrand, Lorenzo Stoakes
  Cc: Zi Yan, Baolin Wang, Liam R . Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Lance Yang, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, amd-gfx, dri-devel,
	linux-kernel, linux-mm, Yitao Jiang

Hi,

This series fixes a THP policy problem I found while debugging
frequent ROCm GPU failures on an AMD Radeon 780M system during ML
training.

Some AMDGPU/KFD user mappings are registered through interval
notifiers and cannot safely tolerate the backing VMA changing from base
pages to a transparent huge page after registration. Userspace can
still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
collapse the range, after the GPU mapping has been registered.

On my system this showed up as asynchronous ROCm/HIP kernel launch
failures, often reported later at a synchronization or copy point. I
expect the issue to be relevant to AMDGPU/KFD mappings on
XNACK-disabled GPUs more generally, because those mappings cannot rely
on replayable GPU faults after a CPU-side THP remap. I have validated
the failure and fix on AMD Radeon 780M / gfx1103.

Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
users can ask the MM core to keep the covered VMA range out of THP
while the notifier is active. The MM core applies VM_NOHUGEPAGE and
clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
over an active opt-in range is treated as an ignored hint, and
MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.

Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
current behavior.

This does not disable THP globally and does not add work to GPU
command submission or kernel launch paths. Additional work is limited
to opt-in notifier registration, opt-in notifier flag transitions, and
MADV_HUGEPAGE attempts that overlap an active opt-in range.

I tested this on top of torvalds/linux commit ab9de95c9cf9 with:

  - scripts/checkpatch.pl --strict --no-tree
  - git apply --check
  - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
    DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
  - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
    originally exposed the failure on my Radeon 780M system

The standalone reproducers depend on ROCm userspace libraries, so I
have not included them in this series. I can send them separately if
useful.

This series was prepared with assistance from OpenAI Codex (GPT-5.5).
I reviewed the resulting code and take responsibility for the
submission.

Yitao Jiang (3):
  mm/mmu_notifier: let interval notifiers block THP
  drm/amdgpu: block THP for HSA userptr notifiers
  drm/amdkfd: block THP for non-replayable SVM ranges

 drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c |  25 ++-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c    |  36 ++++-
 include/linux/huge_mm.h                 |   5 +-
 include/linux/mmu_notifier.h            |  28 ++++
 mm/khugepaged.c                         |   9 +-
 mm/madvise.c                            |   3 +-
 mm/mmu_notifier.c                       | 204 +++++++++++++++++++++++-
 7 files changed, 286 insertions(+), 24 deletions(-)


base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
-- 
2.53.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-06-25 20:51 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-25 10:59 [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings Yitao Jiang
2026-06-25 10:59 ` [PATCH 1/3] mm/mmu_notifier: let interval notifiers block THP Yitao Jiang
2026-06-25 11:18   ` sashiko-bot
2026-06-25 11:50   ` David Hildenbrand (Arm)
2026-06-25 11:58   ` Lorenzo Stoakes
2026-06-25 10:59 ` [PATCH 2/3] drm/amdgpu: block THP for HSA userptr notifiers Yitao Jiang
2026-06-25 11:26   ` sashiko-bot
2026-06-25 12:36   ` Christian König
2026-06-25 10:59 ` [PATCH 3/3] drm/amdkfd: block THP for non-replayable SVM ranges Yitao Jiang
2026-06-25 11:11   ` sashiko-bot
2026-06-25 11:47 ` [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings David Hildenbrand (Arm)
2026-06-25 11:54   ` Lorenzo Stoakes
2026-06-25 12:14     ` 回复: " 蒋 亦韬
2026-06-25 12:35 ` Christian König
2026-06-25 13:01   ` 回复: " 蒋 亦韬
2026-06-25 13:06     ` Christian König
2026-06-25 20:51       ` Kuehling, Felix

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.