From: Honglei Huang <honghuan@amd.com>
To: "Christian König" <christian.koenig@amd.com>
Cc: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
Alexander.Deucher@amd.com, Felix.Kuehling@amd.com,
"Honglei Huang" <honglei1.huang@amd.com>,
Oak.Zeng@amd.com, Jenny-Jing.Liu@amd.com, Philip.Yang@amd.com,
Xiaogang.Chen@amd.com, Ray.Huang@amd.com, Lingshan.Zhu@amd.com,
Junhua.Shen@amd.com, "Matthew Brost" <matthew.brost@intel.com>,
"Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
"Danilo Krummrich" <dakr@kernel.org>,
"Alice Ryhl" <aliceryhl@google.com>
Subject: Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
Date: Wed, 18 Mar 2026 16:59:31 +0800 [thread overview]
Message-ID: <ae5fe946-4756-43b4-848f-3b545ac61ba7@amd.com> (raw)
In-Reply-To: <e21e8e1a-4d2e-40e9-bbb7-2764cf33e760@amd.com>
On 3/17/26 19:48, Christian König wrote:
> Adding a few XE and drm_gpuvm people on TO.
>
> On 3/17/26 12:29, Honglei Huang wrote:
>> From: Honglei Huang <honghuan@amd.com>
>>
>> This is a POC/draft patch series of SVM feature in amdgpu based on the
>> drm_gpusvm framework. The primary purpose of this RFC is to validate
>> the framework's applicability, identify implementation challenges,
>> and start discussion on framework evolution. This is not a production
>> ready submission.
>>
>> This patch series implements basic SVM support with the following features:
>>
>> 1. attributes sepatarated from physical page management:
>>
>> - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
>> tree that stores SVM attributes. Managed through the SET_ATTR,
>> and mmu notifier callback.
>>
>> - Physical page layer (drm_gpusvm ranges): managed by the
>> drm_gpusvm framework, representing actual HMM backed DMA
>> mappings and GPU page table entries.
>>
>> This separation is necessary:
>> - The framework does not support range splitting, so a partial
>> munmap destroys the entire overlapping range, including the
>> still valid parts. If attributes were stored inside drm_gpusvm
>> ranges, they would be lost on unmapping.
>> The separate attr tree preserves userspace set attributes
>> across range operations.
>
> Isn't that actually intended? When parts of the range unmap then that usually means the whole range isn't valid any more.
It is about partial unmap, some subregion in drm_gpusvm_range is still
valid but some other subregion is invalid, but under drm_gpusvm, need to
destroy the entire range.
e.g.:
[---------------unmap region in mmu notifier-----------------]
[0x1000 ------------ 0x9000]
[ valid ][ invalid ]
see deatil in drm_gpusvm.c:110 line
section:Partial Unmapping of Ranges
>
>>
>> - drm_gpusvm range boundaries are determined by fault address
>> and pre setted chunk size, not by userspace attribute boundaries.
>> Ranges may be rechunked on memory changes. Embedding
>> attributes in framework ranges would scatter attr state
>> across many small ranges and require complex reassemble
>> logic when operate attrbute.
>
> Yeah, that makes a lot of sense.
>
>>
>> 2) System memory mapping via drm_gpusvm
>>
>> The core mapping path uses drm_gpusvm_range_find_or_insert() to
>> create ranges, drm_gpusvm_range_get_pages() for HMM page fault
>> and DMA mapping, then updates GPU page tables via
>> amdgpu_vm_update_range().
>>
>> 3) IOCTL driven mapping (XNACK off / no GPU fault mode)
>>
>> On XNACK off hardware the GPU cannot recover from page faults,
>> so mappings must be established through ioctl. When
>> userspace calls SET_ATTR with ACCESS=ENABLE, the driver
>> walks the attr tree and maps all accessible intervals
>> to the GPU by amdgpu_svm_range_map_attr_ranges().
>>
>> 4) Invalidation, GC worker, and restore worker
>>
>> MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
>> three cases based on event type and hardware mode:
>> - unmap event: clear GPU PTEs in the notifier context,
>> unmap DMA pages, mark ranges as unmapped, flush TLB,
>> and enqueue to the GC worker. On XNACK off, also
>> quiesce KFD queues and schedule rebuild of the
>> still valid portions that were destroyed together with
>> the unmapped subregion.
>>
>> - evict on XNACK off:
>> quiesce KFD queues first, then unmap DMA pages and
>> enqueue to the restore worker.
>
> Is that done through the DMA fence or by talking directly to the MES/HWS?
Currently KFD queues quiesce/resume API are reused, lookig forward to a
better solution.
Regards,
Honglei
>
> Thanks,
> Christian.
>
>>
>> - evict on XNACK on:
>> clear GPU PTEs, unmap DMA pages, and flush TLB, but do
>> not schedule any worker. The GPU will fault on next
>> access and the fault handler establishes the mapping.
>>
>> Not supported feature:
>> - XNACK on GPU page fault mode
>> - migration and prefetch feature
>> - Multi GPU support
>>
>> XNACK on enablement is ongoing.The GPUs that support XNACK on
>> are currently only accessible to us via remote lab machines, which slows
>> down progress.
>>
>> Patch overview:
>>
>> 01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
>> SET_ATTR/GET_ATTR operations, attribute types, and related
>> structs in amdgpu_drm.h.
>>
>> 02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
>> refcount, attr_tree, workqueues, locks, and
>> callbacks (begin/end_restore, flush_tlb).
>>
>> 03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
>> (interval tree node), attr_tree, access enum, flag masks,
>> and change trigger enum.
>>
>> 04/12 Attribute tree operations: interval tree lookup, insert,
>> remove, and tree create/destroy lifecycle.
>>
>> 05/12 Attribute set: validate UAPI attributes, apply to internal
>> attrs, handle hole/existing range with head/tail splitting,
>> compute change triggers, and -EAGAIN retry loop.
>> Implements attr_clear_pages for unmap cleanup and attr_get.
>>
>> 06/12 Range data structures: amdgpu_svm_range extending
>> drm_gpusvm_range with gpu_mapped state, pending ops,
>> pte_flags cache, and GC/restore queue linkage.
>>
>> 07/12 PTE flags and GPU mapping: simple gpu pte function,
>> GPU page table update with DMA address, range mapping loop:
>> find_or_insert -> get_pages -> validate -> update PTE,
>> and attribute change driven mapping function.
>>
>> 08/12 Notifier and invalidation: synchronous GPU PTE clear in
>> notifier context, range removal and overlap cleanup,
>> rebuild after destroy logic, and MMU event dispatcher
>>
>> 09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
>> worker for unmap processing and rebuild, ordered restore
>> worker for mapping evicted ranges, and flush/sync
>> helpers.
>>
>> 10/12 Initialization and fini: kmem_cache for range/attr,
>> drm_gpusvm_init with chunk sizes, XNACK detection, TLB
>> flush helper, and amdgpu_svm init/close/fini lifecycle.
>>
>> 11/12 IOCTL and fault handler: PASID based SVM lookup with kref
>> protection, amdgpu_gem_svm_ioctl dispatcher, and
>> amdgpu_svm_handle_fault for GPU page fault recovery.
>>
>> 12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
>> Makefile rules, ioctl table registration, and amdgpu_vm
>> hooks (init in make_compute, close/fini, fault dispatch).
>>
>> Test result:
>> on gfx1100(W7900) and gfx943(MI300x)
>> kfd test: 95%+ passed, same failed cases with offical relase
>> rocr test: all passed
>> hip catch test: 20 cases failed in all 5366 cases, +13 failures vs offical relase
>>
>> During implementation we identified several challenges / design questions:
>>
>> 1. No range splitting on partial unmap
>>
>> drm_gpusvm explicitly does not support range splitting in drm_gpusvm.c:122.
>> Partial munmap needs to destroy the entire range including the valid interval.
>> GPU fault driven hardware can handle this design by extra gpu fault handle,
>> but AMDGPU needs to support XNACK off hardware, this design requires driver
>> rebuild the valid part in the removed entire range. Whichs bring a very heavy
>> restore work in work queue/GC worker: unmap/destroy -> rebuild(insert and map)
>> this restore work even heavier than kfd_svm. In previous driver work queue
>> only needs to restore or unmap, but in drm_gpusvm driver needs to unmap and restore.
>> which brings about more complex logic, heavier worker queue workload, and
>> synchronization issues.
>>
>> 2. Fault driven vs ioctl driven mapping
>>
>> drm_gpusvm is designed around GPU page fault handlers. The primary entry
>> point drm_gpusvm_range_find_or_insert() takes a fault_addr.
>> AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware that
>> GPU cannot fault at all
>>
>> The ioctl path cannot hold mmap_read_lock across the entire operation
>> because drm_gpusvm_range_find_or_insert() acquires/releases it
>> internally. This creates race windows with MMU notifiers / workers.
>>
>> 3. Multi GPU support
>>
>> drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
>> each GPU gets an independent instance with its own range tree, MMU
>> notifiers, notifier_lock, and DMA mappings.
>>
>> This may brings huge overhead:
>> - N x MMU notifier registrations for the same address range
>> - N x hmm_range_fault() calls for the same page (KFD: 1x)
>> - N x DMA mapping memory
>> - N x invalidation + restore worker scheduling per CPU unmap event
>> - N x GPU page table flush / TLB invalidation
>> - Increased mmap_lock hold time, N callbacks serialize under it
>>
>> compatibility issues:
>> - Quiesce/resume scope mismatch: to integrate with KFD compute
>> queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
>> which have process level semantics. Under the per GPU
>> drm_gpusvm model, maybe there are some issues on sync. To properly
>> integrate with KFD under the per SVM model, a compatibility or
>> new per VM level queue control APIs maybe need to introduced.
>>
>> Migration challenges:
>>
>> - No global migration decision logic: each per GPU SVM
>> instance maintains its own attribute tree independently. This
>> allows conflicting settings (e.g., GPU0's SVM sets
>> PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
>> for the same address range) with no detection or resolution.
>> A global attribute coordinator or a shared manager is needed to
>> provide a unified global view for migration decisions
>>
>> - migrate_vma_setup broadcast: one GPU's migration triggers MMU
>> notifier callbacks in ALL N-1 other drm_gpusvm instances,
>> causing N-1 unnecessary restore workers to be scheduled. And
>> creates races between the initiating migration and the other
>> instance's restore attempts.
>>
>> - No cross instance migration serialization: each per GPU
>> drm_gpusvm instance has independent locking, so two GPUs'
>> "decide -> migrate -> remap" sequences can interleave. While
>> the kernel page lock prevents truly simultaneous migration of
>> the same physical page, the losing side's retry (evict from
>> other GPU's VRAM -> migrate back) triggers broadcast notifier
>> invalidations and restore workers, compounding the ping pong
>> problem above.
>>
>> - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
>> hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
>> it only selects system memory pages for migration.
>>
>> - CPU fault reverse migration race: CPU page fault triggers
>> migrate_to_ram while GPU instances are concurrently operating.
>> Per GPU notifier_lock does not protect cross GPU operations.
>>
>> We believe a strong, well designed solution at the framework level is
>> needed to properly address these problems, and we look forward to
>> discussion and suggestions.
>>
>> Honglei Huang (12):
>> drm/amdgpu: add SVM UAPI definitions
>> drm/amdgpu: add SVM data structures and header
>> drm/amdgpu: add SVM attribute data structures
>> drm/amdgpu: implement SVM attribute tree operations
>> drm/amdgpu: implement SVM attribute set
>> drm/amdgpu: add SVM range data structures
>> drm/amdgpu: implement SVM range PTE flags and GPU mapping
>> drm/amdgpu: implement SVM range notifier and invalidation
>> drm/amdgpu: implement SVM range workers
>> drm/amdgpu: implement SVM core initialization and fini
>> drm/amdgpu: implement SVM ioctl and fault handler
>> drm/amdgpu: wire up SVM build system and fault handler
>>
>> drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +
>> drivers/gpu/drm/amd/amdgpu/Makefile | 13 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 430 ++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 894 ++++++++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 ++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 ++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 +-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
>> include/uapi/drm/amdgpu_drm.h | 39 +
>> 12 files changed, 2958 insertions(+), 4 deletions(-)
>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
>> create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
>>
>>
>> base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
>
next prev parent reply other threads:[~2026-03-18 8:59 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 01/12] drm/amdgpu: add SVM UAPI definitions Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 02/12] drm/amdgpu: add SVM data structures and header Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 03/12] drm/amdgpu: add SVM attribute data structures Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 04/12] drm/amdgpu: implement SVM attribute tree operations Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 05/12] drm/amdgpu: implement SVM attribute set Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 06/12] drm/amdgpu: add SVM range data structures Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 07/12] drm/amdgpu: implement SVM range PTE flags and GPU mapping Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 08/12] drm/amdgpu: implement SVM range notifier and invalidation Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 09/12] drm/amdgpu: implement SVM range workers Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 10/12] drm/amdgpu: implement SVM core initialization and fini Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 11/12] drm/amdgpu: implement SVM ioctl and fault handler Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 12/12] drm/amdgpu: wire up SVM build system " Honglei Huang
2026-03-17 11:48 ` [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Christian König
2026-03-18 8:59 ` Honglei Huang [this message]
2026-03-19 5:08 ` Matthew Brost
2026-03-19 14:17 ` Honglei Huang
2026-03-23 6:31 ` Matthew Brost
2026-03-24 7:24 ` Honglei Huang
2026-03-25 22:24 ` Matthew Brost
2026-03-26 12:16 ` Honglei Huang
2026-04-15 10:04 ` Huang, Honglei1
2026-04-23 6:40 ` Matthew Brost
2026-04-23 7:18 ` Matthew Brost
2026-04-23 11:03 ` Huang, Honglei1
2026-04-23 20:21 ` Matthew Brost
2026-04-24 10:43 ` Huang, Honglei1
2026-04-27 20:00 ` Felix Kuehling
2026-04-28 2:23 ` Huang, Honglei1
2026-04-30 3:04 ` Matthew Brost
2026-04-23 6:09 ` Huang, Honglei1
2026-04-23 6:52 ` Matthew Brost
2026-04-23 8:22 ` Huang, Honglei1
2026-04-29 9:56 ` Huang, Honglei1
2026-04-30 2:56 ` Huang, Honglei1
2026-04-30 3:12 ` Matthew Brost
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ae5fe946-4756-43b4-848f-3b545ac61ba7@amd.com \
--to=honghuan@amd.com \
--cc=Alexander.Deucher@amd.com \
--cc=Felix.Kuehling@amd.com \
--cc=Jenny-Jing.Liu@amd.com \
--cc=Junhua.Shen@amd.com \
--cc=Lingshan.Zhu@amd.com \
--cc=Oak.Zeng@amd.com \
--cc=Philip.Yang@amd.com \
--cc=Ray.Huang@amd.com \
--cc=Xiaogang.Chen@amd.com \
--cc=aliceryhl@google.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=dakr@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=honglei1.huang@amd.com \
--cc=matthew.brost@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox