From: Matthew Brost <matthew.brost@intel.com>
To: Honglei Huang <honghuan@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>,
amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
Alexander.Deucher@amd.com, Felix.Kuehling@amd.com,
"Honglei Huang" <honglei1.huang@amd.com>,
Oak.Zeng@amd.com, Jenny-Jing.Liu@amd.com, Philip.Yang@amd.com,
Xiaogang.Chen@amd.com, Ray.Huang@amd.com, Lingshan.Zhu@amd.com,
Junhua.Shen@amd.com,
"Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
"Danilo Krummrich" <dakr@kernel.org>,
"Alice Ryhl" <aliceryhl@google.com>
Subject: Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
Date: Wed, 25 Mar 2026 15:24:47 -0700 [thread overview]
Message-ID: <acRgr7QwdULsn6G2@gsse-cloud1> (raw)
In-Reply-To: <26186168-abff-4ce4-ad93-db9bc2fd68d7@amd.com>
On Tue, Mar 24, 2026 at 03:24:43PM +0800, Honglei Huang wrote:
>
>
> On 3/23/26 14:31, Matthew Brost wrote:
> > On Thu, Mar 19, 2026 at 10:17:36PM +0800, Honglei Huang wrote:
> > >
> > >
> > > On 3/19/26 13:08, Matthew Brost wrote:
> > > > On Wed, Mar 18, 2026 at 04:59:31PM +0800, Honglei Huang wrote:
> > > > >
> > > >
> > > > Disclaimer I haven't look at any code in this series yet.
> > > >
> > > > >
> > > > > On 3/17/26 19:48, Christian König wrote:
> > > > > > Adding a few XE and drm_gpuvm people on TO.
> > > > > >
> > > > > > On 3/17/26 12:29, Honglei Huang wrote:
> > > > > > > From: Honglei Huang <honghuan@amd.com>
> > > > > > >
> > > > > > > This is a POC/draft patch series of SVM feature in amdgpu based on the
> > > > > > > drm_gpusvm framework. The primary purpose of this RFC is to validate
> > > > > > > the framework's applicability, identify implementation challenges,
> > > > > > > and start discussion on framework evolution. This is not a production
> > > >
> > > > +1. Open to any ideas. Given this was designed originally for Xe we very
> > > > well could have missed other drivers requirements.
> > > Hi Matt,
> > >
> > > Thank you for the openness. And thank you so much for the incredibly
> > > detailed and patient response. I really appreciate you taking the time to
> > > walk through each point.
> > >
> >
> > I'm here to help.
> >
> > > Actually I am still a learner when it comes to the drm_gpusvm framework and
> > > GPU SVM design in general. Some of my descriptions below may not be entirely
> > > accurate. But I really want to bring drm_gpusvm into amdgpu and make it work
> > > well.
> >
> > I appreciate another driver jumping in and using this framework—it
> > becomes easier to validate as more users adopt it.
> >
> > >
> > > >
> > > > > > > ready submission.
> > > > > > >
> > > > > > > This patch series implements basic SVM support with the following features:
> > > > > > >
> > > > > > > 1. attributes sepatarated from physical page management:
> > > > > > >
> > > > > > > - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
> > > > > > > tree that stores SVM attributes. Managed through the SET_ATTR,
> > > > > > > and mmu notifier callback.
> > > >
> > > > Can you explain the mmu notifier callback interaction here? See below in
> > > > Xe the attribute tree is existing VMA tree (gpuvm).
> > > >
> > >
> > > Let me try to explain, apologies if the description is not fully
> > > precise.
> > >
> > > In current implementation, the MMU notifier callback interacts with the attr
> > > tree only in the munmap path remove the corresponding attribute
> > > entries from the attr tree so that stale attributes do not persist for
> > > freed address space.
> > >
> >
> > Ah, yes. We reset our attributes upon munmap too. We actually don't this
> > 100% correct quite either and series in flight to fix [1].
> >
> > [1] https://patchwork.freedesktop.org/series/161815/
> >
>
> I studied [1]. This draft has a simliar mechanism to handle attributes when
> munmap. But there are some sligt differences in detail, maybe casued by
> different UMD runtime behaviors.
>
>
> > > > > > >
> > > > > > > - Physical page layer (drm_gpusvm ranges): managed by the
> > > > > > > drm_gpusvm framework, representing actual HMM backed DMA
> > > > > > > mappings and GPU page table entries.
> > > > > > >
> > > > > > > This separation is necessary:
> > > > > > > - The framework does not support range splitting, so a partial
> > > > > > > munmap destroys the entire overlapping range, including the
> > > > > > > still valid parts. If attributes were stored inside drm_gpusvm
> > > > > > > ranges, they would be lost on unmapping.
> > > > > > > The separate attr tree preserves userspace set attributes
> > > > > > > across range operations.
> > > >
> > > > Yes, in Xe the divide is at the VMA level (set by user space) via VM
> > > > bind (parts of VM may be mappings BOs, parts could be setup for SVM) or
> > > > madvise IOCTLs which reflect user space attributes on current SVM
> > > > mappings or future ones.
> > > >
> > > > The SVM range tree reflects mappings that have been faulted into the
> > > > device and contain pages. This is an intentional choice.
> > >
> > > That makes a lot of sense. Thank you for clarifying the design intent. I
> > > think the current adopt the same principle: the drm_gpusvm range tree only
> > > reflect actual faulted in mappings.
> > >
> > > >
> > > > > >
> > > > > > Isn't that actually intended? When parts of the range unmap then that usually means the whole range isn't valid any more.
> > > >
> > > >
> > > > Yes, this was an intentional design choice to not support partial unmap,
> > > > and instead rely on the driver to recreate a new range.
> > > >
> > > > The reasoning is:
> > > >
> > > > - In practice, this should be rare for well-behaved applications.
> > > >
> > > > - With THP / large device pages, if a sub-range is unmapped, the entire
> > > > GPU mapping is invalidated anyway due to the page size change. As a
> > > > result, the cost of creating a new range is minimal, since the device
> > > > will likely fault again on the remaining pages.
> > > >
> > > > So there is no need to over-engineer the common code.
> > > >
> > > > FWIW, to even test partial unmaps in Xe, I had to do things I doubt
> > > > anyone would ever do:
> > > >
> > > > ptr = mmap(SZ_2M);
> > > > /* fault in memory to the device */
> > > > munmap(ptr, SZ_1M);
> > > > /* touch memory again on the device */
> > > >
> > >
> > > Thank you for this explanation and the concrete example. After further
> > > discussion internally with Christian, we are now aligned with same position
> > > partial unmap. Will remove rebuild on partial unmap logic in the next
> > > version and handle it as only partially backed range.
> > >
> > > > >
> > > > >
> > > > > It is about partial unmap, some subregion in drm_gpusvm_range is still valid
> > > > > but some other subregion is invalid, but under drm_gpusvm, need to destroy
> > > > > the entire range.
> > > > >
> > > > > e.g.:
> > > > >
> > > > > [---------------unmap region in mmu notifier-----------------]
> > > > > [0x1000 ------------ 0x9000]
> > > > > [ valid ][ invalid ]
> > > > >
> > > > > see deatil in drm_gpusvm.c:110 line
> > > > > section:Partial Unmapping of Ranges
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > - drm_gpusvm range boundaries are determined by fault address
> > > > > > > and pre setted chunk size, not by userspace attribute boundaries.
> > > > > > > Ranges may be rechunked on memory changes. Embedding
> > > > > > > attributes in framework ranges would scatter attr state
> > > > > > > across many small ranges and require complex reassemble
> > > > > > > logic when operate attrbute.
> > > > > >
> > > > > > Yeah, that makes a lot of sense.
> > > > > >
> > > > > > >
> > > > > > > 2) System memory mapping via drm_gpusvm
> > > > > > >
> > > > > > > The core mapping path uses drm_gpusvm_range_find_or_insert() to
> > > > > > > create ranges, drm_gpusvm_range_get_pages() for HMM page fault
> > > > > > > and DMA mapping, then updates GPU page tables via
> > > > > > > amdgpu_vm_update_range().
> > > > > > >
> > > > > > > 3) IOCTL driven mapping (XNACK off / no GPU fault mode)
> > > > > > >
> > > > > > > On XNACK off hardware the GPU cannot recover from page faults,
> > > > > > > so mappings must be established through ioctl. When
> > > > > > > userspace calls SET_ATTR with ACCESS=ENABLE, the driver
> > > > > > > walks the attr tree and maps all accessible intervals
> > > > > > > to the GPU by amdgpu_svm_range_map_attr_ranges().
> > > >
> > > > Can you expand on XNACK off / GPU no faults? Is this to the share GPU
> > > > between 3D (dma-fences) and faulting clients? We have something similar
> > > > in Xe, but it isn't an explicit IOCTL rather we switch between on demand
> > > > as 3D client submits and then resumes page faults when all dma-fences
> > > > have signaled.
> > > >
> > > > I see below you mention page tables are modified during quiesce KFD
> > > > queues? I'm not sure that is required - you just need to guarnette
> > > > faulting clients won't trigger page faults when dma-fence is in flight.
> > > >
> > > > Maybe give me an explaination of exactly what the requirement from AMD
> > > > are here so I have better picture.
> > >
> > > Thank you for the patience, let me try to explain our situation, though
> > > I may not get every detail right.
> > >
> > > XNACK off means hardware that does not have GPU page fault capability (or
> > > turned off)
> > >
> > > So for these GPUs, ALL page table entries must be fully populated before
> > > the GPU can access the memory. This is why we need the ioctl driven
> > > mapping path, when userspace calls SET_ATTR with ACCESS=ENABLE, need
> > > walk the attribute tree and eagerly map all accessible ranges into the
> > > GPU page tables. This is functionally similar to what you describe as
> > > prefetch IOCTLs / VM bind in Xe.
> > >
> > > Regarding queue quiesce during page table modification: on XNACK off
> > > hardware, because the GPU cannot fault, we must ensure the GPU is
> > > completely stopped before modifying any PTE it might be accessing.
> > > Otherwise the GPU could access a partially updated page table and hang.
> > > The quiesce/resume is the mechanism to guarantee this.
> > >
> > > I hope that helps clarify the picture.
> > >
> >
> > This clarifies a lot. This is what we’d call in Xe “preemption fence”
> > mode for a VM. Anytime memory is moved, we trigger a GPU preemption and
> > resume. We don’t actually support SVM in this case; instead, we use
> > “userptr binds,” which are built on gpusvm for page collection. However,
> > we don’t support migrating memory to the device—though we could.
> >
> > I’d look at how we converted 'userptr' to be based on GPU SVM [2]. In
> > this case, don’t maintain a range tree, as those—as you suggest—are more
> > of an on-demand fault driver concern. Instead, just embed 'struct
> > drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
> >
> > We could extend this to support migrating 'userptr', but we just haven’t
> > done that yet—this may be what you want to do in “XNACK off..
> >
> > [2] https://patchwork.freedesktop.org/series/146553/
> >
>
> Actually we need to swith the xnack mode between on and off, so in xnack off
> mode, the driver operats in "implicit prefetch mode". This may be due to
> compatibility with older hardware and the need for UMD runtime. We will
> further discuss the handling method under xnack off internally.
>
>
> > >
> > > >
> > > > > > >
> > > > > > > 4) Invalidation, GC worker, and restore worker
> > > > > > >
> > > > > > > MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
> > > > > > > three cases based on event type and hardware mode:
> > > > > > > - unmap event: clear GPU PTEs in the notifier context,
> > > > > > > unmap DMA pages, mark ranges as unmapped, flush TLB,
> > > > > > > and enqueue to the GC worker. On XNACK off, also
> > > > > > > quiesce KFD queues and schedule rebuild of the
> > > > > > > still valid portions that were destroyed together with
> > > > > > > the unmapped subregion.
> > > > > > >
> > > > > > > - evict on XNACK off:
> > > > > > > quiesce KFD queues first, then unmap DMA pages and
> > > > > > > enqueue to the restore worker.
> > > > > >
> > > > > > Is that done through the DMA fence or by talking directly to the MES/HWS?
> > > > >
> > > > > Currently KFD queues quiesce/resume API are reused, lookig forward to a
> > > > > better solution.
> > > > >
> > > >
> > > > +1
> > > >
> > > > > Regards,
> > > > > Honglei
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Christian.
> > > > > >
> > > > > > >
> > > > > > > - evict on XNACK on:
> > > > > > > clear GPU PTEs, unmap DMA pages, and flush TLB, but do
> > > > > > > not schedule any worker. The GPU will fault on next
> > > > > > > access and the fault handler establishes the mapping.
> > > > > > >
> > > > > > > Not supported feature:
> > > > > > > - XNACK on GPU page fault mode
> > > > > > > - migration and prefetch feature
> > > > > > > - Multi GPU support
> > > > > > >
> > > > > > > XNACK on enablement is ongoing.The GPUs that support XNACK on
> > > > > > > are currently only accessible to us via remote lab machines, which slows
> > > > > > > down progress.
> > > > > > >
> > > > > > > Patch overview:
> > > > > > >
> > > > > > > 01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
> > > > > > > SET_ATTR/GET_ATTR operations, attribute types, and related
> > > > > > > structs in amdgpu_drm.h.
> > > > > > >
> > > > > > > 02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
> > > > > > > refcount, attr_tree, workqueues, locks, and
> > > > > > > callbacks (begin/end_restore, flush_tlb).
> > > > > > >
> > > > > > > 03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
> > > > > > > (interval tree node), attr_tree, access enum, flag masks,
> > > > > > > and change trigger enum.
> > > > > > >
> > > > > > > 04/12 Attribute tree operations: interval tree lookup, insert,
> > > > > > > remove, and tree create/destroy lifecycle.
> > > > > > >
> > > > > > > 05/12 Attribute set: validate UAPI attributes, apply to internal
> > > > > > > attrs, handle hole/existing range with head/tail splitting,
> > > > > > > compute change triggers, and -EAGAIN retry loop.
> > > > > > > Implements attr_clear_pages for unmap cleanup and attr_get.
> > > > > > >
> > > > > > > 06/12 Range data structures: amdgpu_svm_range extending
> > > > > > > drm_gpusvm_range with gpu_mapped state, pending ops,
> > > > > > > pte_flags cache, and GC/restore queue linkage.
> > > > > > >
> > > > > > > 07/12 PTE flags and GPU mapping: simple gpu pte function,
> > > > > > > GPU page table update with DMA address, range mapping loop:
> > > > > > > find_or_insert -> get_pages -> validate -> update PTE,
> > > > > > > and attribute change driven mapping function.
> > > > > > >
> > > > > > > 08/12 Notifier and invalidation: synchronous GPU PTE clear in
> > > > > > > notifier context, range removal and overlap cleanup,
> > > > > > > rebuild after destroy logic, and MMU event dispatcher
> > > > > > >
> > > > > > > 09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
> > > > > > > worker for unmap processing and rebuild, ordered restore
> > > > > > > worker for mapping evicted ranges, and flush/sync
> > > > > > > helpers.
> > > > > > >
> > > > > > > 10/12 Initialization and fini: kmem_cache for range/attr,
> > > > > > > drm_gpusvm_init with chunk sizes, XNACK detection, TLB
> > > > > > > flush helper, and amdgpu_svm init/close/fini lifecycle.
> > > > > > >
> > > > > > > 11/12 IOCTL and fault handler: PASID based SVM lookup with kref
> > > > > > > protection, amdgpu_gem_svm_ioctl dispatcher, and
> > > > > > > amdgpu_svm_handle_fault for GPU page fault recovery.
> > > > > > >
> > > > > > > 12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
> > > > > > > Makefile rules, ioctl table registration, and amdgpu_vm
> > > > > > > hooks (init in make_compute, close/fini, fault dispatch).
> > > > > > >
> > > > > > > Test result:
> > > > > > > on gfx1100(W7900) and gfx943(MI300x)
> > > > > > > kfd test: 95%+ passed, same failed cases with offical relase
> > > > > > > rocr test: all passed
> > > > > > > hip catch test: 20 cases failed in all 5366 cases, +13 failures vs offical relase
> > > > > > >
> > > > > > > During implementation we identified several challenges / design questions:
> > > > > > >
> > > > > > > 1. No range splitting on partial unmap
> > > > > > >
> > > > > > > drm_gpusvm explicitly does not support range splitting in drm_gpusvm.c:122.
> > > > > > > Partial munmap needs to destroy the entire range including the valid interval.
> > > > > > > GPU fault driven hardware can handle this design by extra gpu fault handle,
> > > > > > > but AMDGPU needs to support XNACK off hardware, this design requires driver
> > > > > > > rebuild the valid part in the removed entire range. Whichs bring a very heavy
> > > > > > > restore work in work queue/GC worker: unmap/destroy -> rebuild(insert and map)
> > > > > > > this restore work even heavier than kfd_svm. In previous driver work queue
> > > > > > > only needs to restore or unmap, but in drm_gpusvm driver needs to unmap and restore.
> > > > > > > which brings about more complex logic, heavier worker queue workload, and
> > > > > > > synchronization issues.
> > > >
> > > > Is this common in the workload you are running? I'm also wondering if
> > > > your restore logic / KFDs design is contributing to this actally the
> > > > problem.
> > > >
> > >
> > > Honestly, you raise a fair point.
> > >
> > > We will redesign the logic about the partial munap, which should eliminate
> > > most of this complexity.
> > >
> > >
> >
> > +1, yes test but do optimize for.
> >
> > > > > > >
> > > > > > > 2. Fault driven vs ioctl driven mapping
> > > > > > >
> > > > > > > drm_gpusvm is designed around GPU page fault handlers. The primary entry
> > > > > > > point drm_gpusvm_range_find_or_insert() takes a fault_addr.
> > > > > > > AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware that
> > > > > > > GPU cannot fault at all
> > > >
> > > > I think we refer to these as prefetch IOCTLs in Xe. Ideally, user space
> > > > issues these so the device does not fault (e.g., prefetch creates a set
> > > > of SVM ranges based on user input). In Xe, prefetch IOCTLs are simply
> > > > specific VM bind operations.
> > > >
> > >
> > > That is a very helpful way to think about it. Yes, our ioctl driven
> > > mapping(xnack off) is essentially equivalent to a prefetch operation. We are
> > > trying to improve it.
> > >
> >
> > See above wrt 'userptr'.
>
> Got it.
>
> >
> > >
> > > > > > >
> > > > > > > The ioctl path cannot hold mmap_read_lock across the entire operation
> > > > > > > because drm_gpusvm_range_find_or_insert() acquires/releases it
> > > > > > > internally. This creates race windows with MMU notifiers / workers.
> > > >
> > > > This is a very intentional choice in the locking design: mmap_read_lock
> > > > is held only in very specific parts of GPU SVM, and the driver should
> > > > never need to take this lock.
> > > >
> > > > Yes, notifiers can race, which is why the GPU fault handler and prefetch
> > > > handler are structured as retry loops when a notifier race is detected.
> > > > In practice, with well-behaved applications, these races should be
> > > > rare—but they do occur, and the driver must handle them.
> > > >
> > > > __xe_svm_handle_pagefault implements the page fault retry loop. VM bind
> > > > prefetch has similar logic, although it is more spread out given that it
> > > > is part of a deeper software pipeline.
> > > >
> > > > FWIW, holding locks to avoid races was rejected by Sima because we
> > > > reasoned it is essentially impossible to guarantee the absence of races
> > > > by holding a lock. CPU page fault handlers are also effectively just
> > > > large retry loops.
> > > >
> > > > So this is one point I believe you will need to fixup driver side.
> > > >
> > >
> > > Understood. Thank you for the detailed explanation and for pointing to
> > > __xe_svm_handle_pagefault as a reference. We will restructure both our
> > > fault handler and ioctl path to a betterretry loop pattern with sequence
> > > number race detection.
> > >
> >
> > Yes, the typical pattern is:
> >
> > - Try to migrate once
> > - If you hit a race, give up, evict all memory back to system memory, and bind it
> >
> > Atomics make this tricky because memory must move, but I’m not sure
> > “XNACK off” applies here. However, GPU SVM provides a timeslice
> > mechanism to ensure the CPU can’t move memory while the GPU needs to
> > execute something.
>
> Understood.
>
> >
> > > > > > >
> > > > > > > 3. Multi GPU support
> > > > > > >
> > > > > > > drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
> > > > > > > each GPU gets an independent instance with its own range tree, MMU
> > > > > > > notifiers, notifier_lock, and DMA mappings.
> > > > > > >
> > > >
> > > > This is a part I am absolutely open to fixing. Right now, each
> > > > drm_gpusvm_range has a single set of drm_gpusvm_pages. I am open to
> > > > decoupling a GPU SVM instance from a single device, allowing each
> > > > drm_gpusvm_range to have multiple sets of drm_gpusvm_pages (one per
> > > > device).
> > > >
> > > > This would give drivers the flexibility to use one GPU SVM instance per
> > > > VM/device instance (as in Xe), or to maintain a single GPU SVM per CPU
> > > > MM.
> > > >
> > >
> > > That would be wonderful! Looking forward to your patch very much!
> > >
> >
> > I can't say I'll code this but we thought about is as options and very
> > open patches which refactor the object model for multiple use cases.
>
> Understood. I will focus on single GPU first, and once we have a
> solid v1, we'd be happy to explore contributing patches for the
> multi-device object model refactoring.
>
I think roughly what would need to be done is:
- Move struct drm_gpusvm_pages out of struct drm_gpusvm_range.
- Embed either a struct device or a struct drm_device in struct
drm_gpusvm_pages.
- Drop struct drm_device from struct drm_gpusvm.
- Have the driver’s range structure embed one or more struct
drm_gpusvm_pages in addition to struct drm_gpusvm_range.
- Refactor a few range-based helpers (drm_gpusvm_range_pages_valid,
drm_gpusvm_range_get_pages, drm_gpusvm_range_unmap_pages), or simply
drop them entirely and update drivers to use the drm_gpusvm_pages
helpers instead.
Then it is up to the drivers whether struct drm_gpusvm maps to a single
device or multiple devices. Either use case seems valid, and giving
drivers the option appears to be the right approach, rather than having
the common drm_gpusvm layer impose its own constraints.
This type of refactor can be done at any time as an independent patch,
so feel free to post it whenever and I can verify on the Xe side that
everything looks good.
Matt
> >
> > >
> > > > > > > This may brings huge overhead:
> > > > > > > - N x MMU notifier registrations for the same address range
> > > >
> > > > The notifier overhead is a real concern. We recently introduced two-pass
> > > > notifiers [1] to speed up multi-device notifiers. At least in Xe, the
> > > > TLB invalidations—which are the truly expensive part—can be pipelined
> > > > using the two=pass approach. Currently, [1] only implements two-pass
> > > > notifiers for userptr, but Xe’s GPU SVM will be updated to use them
> > > > shortly.
> > > >
> > > > [1] https://patchwork.freedesktop.org/series/153280/
> > > >
> > >
> > > Thank you for the pointer to two-pass notifiers. Will study this
> > > series.
> > >
> > > > > > > - N x hmm_range_fault() calls for the same page (KFD: 1x)
> > > >
> > > > hmm_range_fault is extremely fast compared to the actual migration.
> > > > Running hmm_range_fault on a 2MB region using 4KB pages takes less
> > > > than 1µs. With THP or large device pages [2] (merged last week), it’s
> > > > around 1/20 of a microsecond. So I wouldn’t be too concerned about this.
> > > >
> > > > [2] https://patchwork.freedesktop.org/series/163141/
> > > >
> > >
> > > That is very helpful data. Perhaps worry too much.
> > >
> > > > > > > - N x DMA mapping memory
> > > >
> > > > You will always have N x DMA mapping memory if the pages are in system
> > > > memory as the dma-mapping API is per device.
> > >
> > > Totally agreed.
> > >
> > > >
> > > > > > > - N x invalidation + restore worker scheduling per CPU unmap event
> > > > > > > - N x GPU page table flush / TLB invalidation
> > > >
> > > > I agree you do not want serialize GPU page table flush / TLB
> > > > invalidations. Hence two-pass notifiers [1].
> > >
> > > Yes, will learn it.
> > >
> > > >
> > > > > > > - Increased mmap_lock hold time, N callbacks serialize under it
> > > > > > >
> > > > > > > compatibility issues:
> > > > > > > - Quiesce/resume scope mismatch: to integrate with KFD compute
> > > > > > > queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
> > > > > > > which have process level semantics. Under the per GPU
> > > > > > > drm_gpusvm model, maybe there are some issues on sync. To properly
> > > > > > > integrate with KFD under the per SVM model, a compatibility or
> > > > > > > new per VM level queue control APIs maybe need to introduced.
> > > > > > >
> > > >
> > > > I thought the idea to get rid of KFD and move over to AMDGPU? I thought
> > > > Christian mentioned this to me at XDC.
> > > >
> > >
> > > > > > > Migration challenges:
> > > > > > >
> > > > > > > - No global migration decision logic: each per GPU SVM
> > > > > > > instance maintains its own attribute tree independently. This
> > > > > > > allows conflicting settings (e.g., GPU0's SVM sets
> > > > > > > PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
> > > > > > > for the same address range) with no detection or resolution.
> > > > > > > A global attribute coordinator or a shared manager is needed to
> > > > > > > provide a unified global view for migration decisions
> > > >
> > > > Yes, this is hole in the Xe API too. We have told UMDs if they setup
> > > > individual VMs with conflict attributes for a single CPU address space
> > > > the behavior is undefined. Our UMD implement madvise is basically loop
> > > > over al GPU VMs setting the same attributes.
> > >
> > > Will follow the same approach for now, the UMD is responsible for setting
> > > consistent attributes across GPU VMs.
> > >
> >
> > +1
> >
> > > >
> > > > > > >
> > > > > > > - migrate_vma_setup broadcast: one GPU's migration triggers MMU
> > > > > > > notifier callbacks in ALL N-1 other drm_gpusvm instances,
> > > > > > > causing N-1 unnecessary restore workers to be scheduled. And
> > > >
> > > > My feeling is that you shouldn’t reschedule restore workers unless you
> > > > actually have to invalidate page tables (i.e., you have a local SVM
> > > > range within the notifier). So the first migration to an untouched
> > > > region may trigger notifiers, but they won’t do anything because you
> > > > don’t have any valid SVM ranges yet. Subsequent mappings of the migrated
> > > > region won’t trigger a notifier unless the memory is moved again.
> > > >
> > >
> > > That is a very good point. We should check whether we actually have
> > > valid SVM ranges before scheduling restore workers. If there is nothing
> > > to invalidate, the notifier callback should be a no-op. We will review
> > > our notifier callback logic to ensure we are not doing unnecessary work
> > > here. Thank you for pointing this out.
> > >
> > > > > > > creates races between the initiating migration and the other
> > > > > > > instance's restore attempts.
> > > >
> > > > Yes, if multiple devices try to migrate the same CPU pages at the same
> > > > time, that will race. That’s why in Xe we have a module-level
> > > > driver_migrate_lock. The first migration runs in read mode; if it
> > > > detects a race and aborts, it then takes driver_migrate_lock in write
> > > > mode so it becomes the only device allowed to move memory / CPU pages.
> > > > See xe_svm_alloc_vram() for how this is used.
> > > >
> > > > I’m not sure this approach will work for you, but I just wanted to point
> > > > out that we identified this as a potential issue.
> > > >
> > >
> > > Thank you for sharing the driver_migrate_lock approach and pointing to
> > > xe_svm_alloc_vram(). Will explore whether a similar lock pattern can work
> > > for our case.
> > >
> > > > > > >
> > > > > > > - No cross instance migration serialization: each per GPU
> > > > > > > drm_gpusvm instance has independent locking, so two GPUs'
> > > > > > > "decide -> migrate -> remap" sequences can interleave. While
> > > > > > > the kernel page lock prevents truly simultaneous migration of
> > > > > > > the same physical page, the losing side's retry (evict from
> > > > > > > other GPU's VRAM -> migrate back) triggers broadcast notifier
> > > > > > > invalidations and restore workers, compounding the ping pong
> > > > > > > problem above.
> > > > > > >
> > > >
> > > > See the driver_migrate_lock above.
> > >
> > > Acknowledged, thank you.
> > > >
> > > > > > > - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
> > > > > > > hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
> > > > > > > it only selects system memory pages for migration.
> > > > > > >
> > > >
> > > > I think this is fixed? We did find some core MM bugs that blocked VRAM
> > > > to VRAM but those have been worked out.
> > > >
> > > > The code I'm looking at:
> > > >
> > > > 517 int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
> > > > 518 struct mm_struct *mm,
> > > > 519 unsigned long start, unsigned long end,
> > > > 520 const struct drm_pagemap_migrate_details *mdetails)
> > > > 521 {
> > > > 522 const struct drm_pagemap_devmem_ops *ops = devmem_allocation->ops;
> > > > 523 struct drm_pagemap *dpagemap = devmem_allocation->dpagemap;
> > > > 524 struct dev_pagemap *pagemap = dpagemap->pagemap;
> > > > 525 struct migrate_vma migrate = {
> > > > 526 .start = start,
> > > > 527 .end = end,
> > > > 528 .pgmap_owner = pagemap->owner,
> > > > 529 .flags = MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT |
> > > > 530 MIGRATE_VMA_SELECT_DEVICE_PRIVATE | MIGRATE_VMA_SELECT_COMPOUND,
> > > > 531 };
> > > >
> > >
> > > Thank you for checking! I am using v6.18 for this POC, missed the fix, will
> > > rebase to the latest.
> > >
> > >
> > > > > > > - CPU fault reverse migration race: CPU page fault triggers
> > > > > > > migrate_to_ram while GPU instances are concurrently operating.
> > > > > > > Per GPU notifier_lock does not protect cross GPU operations.
> > > >
> > > > No, again retry loop as discussed above.
> > >
> > > Understood.
> > >
> > > >
> > > > > > >
> > > > > > > We believe a strong, well designed solution at the framework level is
> > > > > > > needed to properly address these problems, and we look forward to
> > > > > > > discussion and suggestions.
> > > >
> > > > Let's work together to figure out what is missing here.
> > >
> > > Thank you so much, Matt. Your feedback has been incredibly valuable and
> > > has given us a much clearer picture of the framework's design.
> > > Ireally appreciate the effort you put into building drm_gpusvm as a
> > > shared framework. Will incorporate your suggestions into our next
> > > revision and look forward to continuing the collaboration.
> > >
> >
> > No problem. Happy to help.
>
> Thank you again for all the detailed feedback.
>
> Regards,
> Honglei
>
> >
> > Matt
> >
> > > Regards,
> > > Honglei
> > >
> > >
> > > >
> > > > Matt
> > > >
> > > > > > >
> > > > > > > Honglei Huang (12):
> > > > > > > drm/amdgpu: add SVM UAPI definitions
> > > > > > > drm/amdgpu: add SVM data structures and header
> > > > > > > drm/amdgpu: add SVM attribute data structures
> > > > > > > drm/amdgpu: implement SVM attribute tree operations
> > > > > > > drm/amdgpu: implement SVM attribute set
> > > > > > > drm/amdgpu: add SVM range data structures
> > > > > > > drm/amdgpu: implement SVM range PTE flags and GPU mapping
> > > > > > > drm/amdgpu: implement SVM range notifier and invalidation
> > > > > > > drm/amdgpu: implement SVM range workers
> > > > > > > drm/amdgpu: implement SVM core initialization and fini
> > > > > > > drm/amdgpu: implement SVM ioctl and fault handler
> > > > > > > drm/amdgpu: wire up SVM build system and fault handler
> > > > > > >
> > > > > > > drivers/gpu/drm/amd/amdgpu/Kconfig | 11 +
> > > > > > > drivers/gpu/drm/amd/amdgpu/Makefile | 13 +
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c | 430 ++++++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h | 147 ++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c | 894 ++++++++++++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h | 110 ++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h | 76 ++
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 40 +-
> > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 4 +
> > > > > > > include/uapi/drm/amdgpu_drm.h | 39 +
> > > > > > > 12 files changed, 2958 insertions(+), 4 deletions(-)
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
> > > > > > > create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
> > > > > > >
> > > > > > >
> > > > > > > base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
> > > > > >
> > > > >
> > >
>
next prev parent reply other threads:[~2026-03-25 22:24 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 01/12] drm/amdgpu: add SVM UAPI definitions Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 02/12] drm/amdgpu: add SVM data structures and header Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 03/12] drm/amdgpu: add SVM attribute data structures Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 04/12] drm/amdgpu: implement SVM attribute tree operations Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 05/12] drm/amdgpu: implement SVM attribute set Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 06/12] drm/amdgpu: add SVM range data structures Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 07/12] drm/amdgpu: implement SVM range PTE flags and GPU mapping Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 08/12] drm/amdgpu: implement SVM range notifier and invalidation Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 09/12] drm/amdgpu: implement SVM range workers Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 10/12] drm/amdgpu: implement SVM core initialization and fini Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 11/12] drm/amdgpu: implement SVM ioctl and fault handler Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 12/12] drm/amdgpu: wire up SVM build system " Honglei Huang
2026-03-17 11:48 ` [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Christian König
2026-03-18 8:59 ` Honglei Huang
2026-03-19 5:08 ` Matthew Brost
2026-03-19 14:17 ` Honglei Huang
2026-03-23 6:31 ` Matthew Brost
2026-03-24 7:24 ` Honglei Huang
2026-03-25 22:24 ` Matthew Brost [this message]
2026-03-26 12:16 ` Honglei Huang
2026-04-15 10:04 ` Huang, Honglei1
2026-04-23 6:40 ` Matthew Brost
2026-04-23 7:18 ` Matthew Brost
2026-04-23 11:03 ` Huang, Honglei1
2026-04-23 20:21 ` Matthew Brost
2026-04-24 10:43 ` Huang, Honglei1
2026-04-27 20:00 ` Felix Kuehling
2026-04-28 2:23 ` Huang, Honglei1
2026-04-30 3:04 ` Matthew Brost
2026-04-23 6:09 ` Huang, Honglei1
2026-04-23 6:52 ` Matthew Brost
2026-04-23 8:22 ` Huang, Honglei1
2026-04-29 9:56 ` Huang, Honglei1
2026-04-30 2:56 ` Huang, Honglei1
2026-04-30 3:12 ` Matthew Brost
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=acRgr7QwdULsn6G2@gsse-cloud1 \
--to=matthew.brost@intel.com \
--cc=Alexander.Deucher@amd.com \
--cc=Felix.Kuehling@amd.com \
--cc=Jenny-Jing.Liu@amd.com \
--cc=Junhua.Shen@amd.com \
--cc=Lingshan.Zhu@amd.com \
--cc=Oak.Zeng@amd.com \
--cc=Philip.Yang@amd.com \
--cc=Ray.Huang@amd.com \
--cc=Xiaogang.Chen@amd.com \
--cc=aliceryhl@google.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=dakr@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=honghuan@amd.com \
--cc=honglei1.huang@amd.com \
--cc=rodrigo.vivi@intel.com \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox