Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Huang, Honglei1" <honghuan@amd.com>
To: "Matthew Brost" <matthew.brost@intel.com>,
	"Christian König" <christian.koenig@amd.com>,
	Felix.Kuehling@amd.com, Philip.Yang@amd.com
Cc: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	Alexander.Deucher@amd.com,
	"Honglei Huang" <honglei1.huang@amd.com>,
	Oak.Zeng@amd.com, Jenny-Jing.Liu@amd.com, Xiaogang.Chen@amd.com,
	Ray.Huang@amd.com, Lingshan.Zhu@amd.com, Junhua.Shen@amd.com,
	"Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
	"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
	"Danilo Krummrich" <dakr@kernel.org>,
	"Alice Ryhl" <aliceryhl@google.com>
Subject: Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
Date: Thu, 23 Apr 2026 14:09:59 +0800	[thread overview]
Message-ID: <a9d979e7-6ff1-47c5-87c5-b905e5c84ded@amd.com> (raw)
In-Reply-To: <acDeRhCTh/ehOUyu@lstrano-desk.jf.intel.com>



On 3/23/2026 2:31 PM, Matthew Brost wrote:
> On Thu, Mar 19, 2026 at 10:17:36PM +0800, Honglei Huang wrote:
>>
>>
>> On 3/19/26 13:08, Matthew Brost wrote:
>>> On Wed, Mar 18, 2026 at 04:59:31PM +0800, Honglei Huang wrote:
>>>>
>>>
>>> Disclaimer I haven't look at any code in this series yet.
>>>
>>>>
>>>> On 3/17/26 19:48, Christian König wrote:
>>>>> Adding a few XE and drm_gpuvm people on TO.
>>>>>
>>>>> On 3/17/26 12:29, Honglei Huang wrote:
>>>>>> From: Honglei Huang <honghuan@amd.com>
>>>>>>
>>>>>> This is a POC/draft patch series of SVM feature in amdgpu based on the
>>>>>> drm_gpusvm framework. The primary purpose of this RFC is to validate
>>>>>> the framework's applicability, identify implementation challenges,
>>>>>> and start discussion on framework evolution. This is not a production
>>>
>>> +1. Open to any ideas. Given this was designed originally for Xe we very
>>> well could have missed other drivers requirements.
>> Hi Matt,
>>
>> Thank you for the openness. And thank you so much for the incredibly
>> detailed and patient response. I really appreciate you taking the time to
>> walk through each point.
>>
> 
> I'm here to help.
> 
>> Actually I am still a learner when it comes to the drm_gpusvm framework and
>> GPU SVM design in general. Some of my descriptions below may not be entirely
>> accurate. But I really want to bring drm_gpusvm into amdgpu and make it work
>> well.
> 
> I appreciate another driver jumping in and using this framework—it
> becomes easier to validate as more users adopt it.
> 
>>
>>>
>>>>>> ready submission.
>>>>>>
>>>>>> This patch series implements basic SVM support with the following features:
>>>>>>
>>>>>>      1. attributes sepatarated from physical page management:
>>>>>>
>>>>>>        - Attribute layer (amdgpu_svm_attr_tree): a driver side interval
>>>>>>          tree that stores SVM attributes. Managed through the SET_ATTR,
>>>>>>          and mmu notifier callback.
>>>
>>> Can you explain the mmu notifier callback interaction here? See below in
>>> Xe the attribute tree is existing VMA tree (gpuvm).
>>>
>>
>> Let me try to explain, apologies if the description is not fully
>> precise.
>>
>> In current implementation, the MMU notifier callback interacts with the attr
>> tree only in the munmap path remove the corresponding attribute
>> entries from the attr tree so that stale attributes do not persist for
>> freed address space.
>>
> 
> Ah, yes. We reset our attributes upon munmap too. We actually don't this
> 100% correct quite either and series in flight to fix [1].
> 
> [1] https://patchwork.freedesktop.org/series/161815/

Hi matt,

It seems like you are tring to modify the implementation into remove the 
attributes when munmap.

Actuall we have a discussion internally that does the driver need to 
remove the attributes when munmap.

So there servel ideas:

1. attribute need keep: attributes may be needed again when a new VMA 
appears or on subsequent faults.
2.attribute need keep: attributes can be set independent of whether 
memory is currently mapped; attributes persist and are modified 
explicitly via ioctl, not implicitly by notifier callbacks.
3. attribute need remove: casue VMA is gone, driver can do nothing 
without VMA.

and I saw xe_svm set default attribute in the previous version, this is 
also a option.

Can you please help to give some information that why xe_svm is turing 
to remove the attribute when munmap? And does keeping attribute is a 
valid way?


Regards,
Honglei

> 
>>>>>>
>>>>>>        - Physical page layer (drm_gpusvm ranges): managed by the
>>>>>>          drm_gpusvm framework, representing actual HMM backed DMA
>>>>>>          mappings and GPU page table entries.
>>>>>>
>>>>>>         This separation is necessary:
>>>>>>           -  The framework does not support range splitting, so a partial
>>>>>>              munmap destroys the entire overlapping range, including the
>>>>>>              still valid parts. If attributes were stored inside drm_gpusvm
>>>>>>              ranges, they would be lost on unmapping.
>>>>>>              The separate attr tree preserves userspace set attributes
>>>>>>              across range operations.
>>>
>>> Yes, in Xe the divide is at the VMA level (set by user space) via VM
>>> bind (parts of VM may be mappings BOs, parts could be setup for SVM) or
>>> madvise IOCTLs which reflect user space attributes on current SVM
>>> mappings or future ones.
>>>
>>> The SVM range tree reflects mappings that have been faulted into the
>>> device and contain pages. This is an intentional choice.
>>
>> That makes a lot of sense. Thank you for clarifying the design intent. I
>> think the current adopt the same principle: the drm_gpusvm range tree only
>> reflect actual faulted in mappings.
>>
>>>
>>>>>
>>>>> Isn't that actually intended? When parts of the range unmap then that usually means the whole range isn't valid any more.
>>>
>>>
>>> Yes, this was an intentional design choice to not support partial unmap,
>>> and instead rely on the driver to recreate a new range.
>>>
>>> The reasoning is:
>>>
>>> - In practice, this should be rare for well-behaved applications.
>>>
>>> - With THP / large device pages, if a sub-range is unmapped, the entire
>>> GPU mapping is invalidated anyway due to the page size change. As a
>>> result, the cost of creating a new range is minimal, since the device
>>> will likely fault again on the remaining pages.
>>>
>>> So there is no need to over-engineer the common code.
>>>
>>> FWIW, to even test partial unmaps in Xe, I had to do things I doubt
>>> anyone would ever do:
>>>
>>> ptr = mmap(SZ_2M);
>>> /* fault in memory to the device */
>>> munmap(ptr, SZ_1M);
>>> /* touch memory again on the device */
>>>
>>
>> Thank you for this explanation and the concrete example. After further
>> discussion internally with Christian, we are now aligned with same position
>> partial unmap. Will remove rebuild on partial unmap logic in the next
>> version and handle it as only partially backed range.
>>
>>>>
>>>>
>>>> It is about partial unmap, some subregion in drm_gpusvm_range is still valid
>>>> but some other subregion is invalid, but under drm_gpusvm, need to destroy
>>>> the entire range.
>>>>
>>>> e.g.:
>>>>
>>>>             [---------------unmap region in mmu notifier-----------------]
>>>> [0x1000 ------------ 0x9000]
>>>> [  valid ][     invalid    ]
>>>>
>>>> see deatil in drm_gpusvm.c:110 line
>>>> section:Partial Unmapping of Ranges
>>>>
>>>>
>>>>>
>>>>>>
>>>>>>           -  drm_gpusvm range boundaries are determined by fault address
>>>>>>              and pre setted chunk size, not by userspace attribute boundaries.
>>>>>>              Ranges  may be rechunked on memory changes. Embedding
>>>>>>              attributes in framework ranges would scatter attr state
>>>>>>              across many small ranges and require complex reassemble
>>>>>>              logic when operate attrbute.
>>>>>
>>>>> Yeah, that makes a lot of sense.
>>>>>
>>>>>>
>>>>>>      2) System memory mapping via drm_gpusvm
>>>>>>
>>>>>>         The core mapping path uses drm_gpusvm_range_find_or_insert() to
>>>>>>         create ranges, drm_gpusvm_range_get_pages() for HMM page fault
>>>>>>         and DMA mapping, then updates GPU page tables via
>>>>>>         amdgpu_vm_update_range().
>>>>>>
>>>>>>      3) IOCTL driven mapping (XNACK off / no GPU fault mode)
>>>>>>
>>>>>>         On XNACK off hardware the GPU cannot recover from page faults,
>>>>>>         so mappings must be established through ioctl. When
>>>>>>         userspace calls SET_ATTR with ACCESS=ENABLE, the driver
>>>>>>         walks the attr tree and maps all accessible intervals
>>>>>>         to the GPU by amdgpu_svm_range_map_attr_ranges().
>>>
>>> Can you expand on XNACK off / GPU no faults? Is this to the share GPU
>>> between 3D (dma-fences) and faulting clients? We have something similar
>>> in Xe, but it isn't an explicit IOCTL rather we switch between on demand
>>> as 3D client submits and then resumes page faults when all dma-fences
>>> have signaled.
>>>
>>> I see below you mention page tables are modified during quiesce KFD
>>> queues? I'm not sure that is required - you just need to guarnette
>>> faulting clients won't trigger page faults when dma-fence is in flight.
>>>
>>> Maybe give me an explaination of exactly what the requirement from AMD
>>> are here so I have better picture.
>>
>> Thank you for the patience, let me try to explain our situation, though
>> I may not get every detail right.
>>
>> XNACK off means hardware that does not have GPU page fault capability (or
>> turned off)
>>
>> So for these GPUs, ALL page table entries must be fully populated before
>> the GPU can access the memory. This is why we need the ioctl driven
>> mapping path, when userspace calls SET_ATTR with ACCESS=ENABLE, need
>> walk the attribute tree and eagerly map all accessible ranges into the
>> GPU page tables. This is functionally similar to what you describe as
>> prefetch IOCTLs / VM bind in Xe.
>>
>> Regarding queue quiesce during page table modification: on XNACK off
>> hardware, because the GPU cannot fault, we must ensure the GPU is
>> completely stopped before modifying any PTE it might be accessing.
>> Otherwise the GPU could access a partially updated page table and hang.
>> The quiesce/resume is the mechanism to guarantee this.
>>
>> I hope that helps clarify the picture.
>>
> 
> This clarifies a lot. This is what we’d call in Xe “preemption fence”
> mode for a VM. Anytime memory is moved, we trigger a GPU preemption and
> resume. We don’t actually support SVM in this case; instead, we use
> “userptr binds,” which are built on gpusvm for page collection. However,
> we don’t support migrating memory to the device—though we could.
> 
> I’d look at how we converted 'userptr' to be based on GPU SVM [2]. In
> this case, don’t maintain a range tree, as those—as you suggest—are more
> of an on-demand fault driver concern. Instead, just embed 'struct
> drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
> 
> We could extend this to support migrating 'userptr', but we just haven’t
> done that yet—this may be what you want to do in “XNACK off..
> 
> [2] https://patchwork.freedesktop.org/series/146553/
> 
>>
>>>
>>>>>>
>>>>>>      4) Invalidation, GC worker, and restore worker
>>>>>>
>>>>>>         MMU notifier callbacks (amdgpu_svm_range_invalidate) handle
>>>>>>         three cases based on event type and hardware mode:
>>>>>>           - unmap event: clear GPU PTEs in the notifier context,
>>>>>>             unmap DMA pages, mark ranges as unmapped, flush TLB,
>>>>>>             and enqueue to the GC worker. On XNACK off, also
>>>>>>             quiesce KFD queues and schedule rebuild of the
>>>>>>             still valid portions that were destroyed together with
>>>>>>             the unmapped subregion.
>>>>>>
>>>>>>           - evict on XNACK off:
>>>>>>             quiesce KFD queues first, then unmap DMA pages and
>>>>>>             enqueue to the restore worker.
>>>>>
>>>>> Is that done through the DMA fence or by talking directly to the MES/HWS?
>>>>
>>>> Currently KFD queues quiesce/resume API are reused, lookig forward to a
>>>> better solution.
>>>>
>>>
>>> +1
>>>
>>>> Regards,
>>>> Honglei
>>>>
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>           - evict on XNACK on:
>>>>>>             clear GPU PTEs, unmap DMA pages, and flush TLB, but do
>>>>>>             not schedule any worker. The GPU will fault on next
>>>>>>             access and the fault handler establishes the mapping.
>>>>>>
>>>>>> Not supported feature:
>>>>>>      - XNACK on GPU page fault mode
>>>>>>      - migration and prefetch feature
>>>>>>      - Multi GPU support
>>>>>>
>>>>>>      XNACK on enablement is ongoing.The GPUs that support XNACK on
>>>>>>      are currently only accessible to us via remote lab machines, which slows
>>>>>>      down progress.
>>>>>>
>>>>>> Patch overview:
>>>>>>
>>>>>>      01/12 UAPI definitions: DRM_AMDGPU_GEM_SVM ioctl, SVM flags,
>>>>>>            SET_ATTR/GET_ATTR operations, attribute types, and related
>>>>>>            structs in amdgpu_drm.h.
>>>>>>
>>>>>>      02/12 Core data structures: amdgpu_svm wrapping drm_gpusvm with
>>>>>>            refcount, attr_tree, workqueues, locks, and
>>>>>>            callbacks (begin/end_restore, flush_tlb).
>>>>>>
>>>>>>      03/12 Attribute data structures: amdgpu_svm_attrs, attr_range
>>>>>>            (interval tree node), attr_tree, access enum, flag masks,
>>>>>>            and change trigger enum.
>>>>>>
>>>>>>      04/12 Attribute tree operations: interval tree lookup, insert,
>>>>>>            remove, and tree create/destroy lifecycle.
>>>>>>
>>>>>>      05/12 Attribute set: validate UAPI attributes, apply to internal
>>>>>>            attrs, handle hole/existing range with head/tail splitting,
>>>>>>            compute change triggers, and -EAGAIN retry loop.
>>>>>>            Implements attr_clear_pages for unmap cleanup and attr_get.
>>>>>>
>>>>>>      06/12 Range data structures: amdgpu_svm_range extending
>>>>>>            drm_gpusvm_range with gpu_mapped state, pending ops,
>>>>>>            pte_flags cache, and GC/restore queue linkage.
>>>>>>
>>>>>>      07/12 PTE flags and GPU mapping: simple gpu pte function,
>>>>>>            GPU page table update with DMA address, range mapping loop:
>>>>>>            find_or_insert -> get_pages -> validate -> update PTE,
>>>>>>            and attribute change driven mapping function.
>>>>>>
>>>>>>      08/12 Notifier and invalidation: synchronous GPU PTE clear in
>>>>>>            notifier context, range removal and overlap cleanup,
>>>>>>            rebuild after destroy logic, and MMU event dispatcher
>>>>>>
>>>>>>      09/12 Workers: KFD queue quiesce/resume via kgd2kfd APIs, GC
>>>>>>            worker for unmap processing and rebuild, ordered restore
>>>>>>            worker for mapping evicted ranges, and flush/sync
>>>>>>            helpers.
>>>>>>
>>>>>>      10/12 Initialization and fini: kmem_cache for range/attr,
>>>>>>            drm_gpusvm_init with chunk sizes, XNACK detection, TLB
>>>>>>            flush helper, and amdgpu_svm init/close/fini lifecycle.
>>>>>>
>>>>>>      11/12 IOCTL and fault handler: PASID based SVM lookup with kref
>>>>>>            protection, amdgpu_gem_svm_ioctl dispatcher, and
>>>>>>            amdgpu_svm_handle_fault for GPU page fault recovery.
>>>>>>
>>>>>>      12/12 Build integration: Kconfig option (CONFIG_DRM_AMDGPU_SVM),
>>>>>>            Makefile rules, ioctl table registration, and amdgpu_vm
>>>>>>            hooks (init in make_compute, close/fini, fault dispatch).
>>>>>>
>>>>>> Test result:
>>>>>>      on gfx1100(W7900) and gfx943(MI300x)
>>>>>>      kfd test: 95%+ passed, same failed cases with offical relase
>>>>>>      rocr test: all passed
>>>>>>      hip catch test: 20 cases failed in all 5366 cases, +13 failures vs offical relase
>>>>>>
>>>>>> During implementation we identified several challenges / design questions:
>>>>>>
>>>>>> 1. No range splitting on partial unmap
>>>>>>
>>>>>>      drm_gpusvm explicitly does not support range splitting in drm_gpusvm.c:122.
>>>>>>      Partial munmap needs to destroy the entire range including the valid interval.
>>>>>>      GPU fault driven hardware can handle this design by extra gpu fault handle,
>>>>>>      but AMDGPU needs to support XNACK off hardware, this design requires driver
>>>>>>      rebuild the valid part in the removed entire range. Whichs bring a very heavy
>>>>>>      restore work in work queue/GC worker: unmap/destroy -> rebuild(insert and map)
>>>>>>      this restore work even heavier than kfd_svm. In previous driver work queue
>>>>>>      only needs to restore or unmap, but in drm_gpusvm driver needs to unmap and restore.
>>>>>>      which brings about more complex logic, heavier worker queue workload, and
>>>>>>      synchronization issues.
>>>
>>> Is this common in the workload you are running? I'm also wondering if
>>> your restore logic / KFDs design is contributing to this actally the
>>> problem.
>>>
>>
>> Honestly, you raise a fair point.
>>
>> We will redesign the logic about the partial munap， which should eliminate
>> most of this complexity.
>>
>>
> 
> +1, yes test but do optimize for.
> 
>>>>>>
>>>>>> 2. Fault driven vs ioctl driven mapping
>>>>>>
>>>>>>      drm_gpusvm is designed around GPU page fault handlers. The primary entry
>>>>>>      point drm_gpusvm_range_find_or_insert() takes a fault_addr.
>>>>>>      AMDGPU needs to support IOCTL driven mapping cause No XNACK hardware that
>>>>>>      GPU cannot fault at all
>>>
>>> I think we refer to these as prefetch IOCTLs in Xe. Ideally, user space
>>> issues these so the device does not fault (e.g., prefetch creates a set
>>> of SVM ranges based on user input). In Xe, prefetch IOCTLs are simply
>>> specific VM bind operations.
>>>
>>
>> That is a very helpful way to think about it. Yes, our ioctl driven
>> mapping(xnack off) is essentially equivalent to a prefetch operation. We are
>> trying to improve it.
>>
> 
> See above wrt 'userptr'.
> 
>>
>>>>>>
>>>>>>      The ioctl path cannot hold mmap_read_lock across the entire operation
>>>>>>      because drm_gpusvm_range_find_or_insert() acquires/releases it
>>>>>>      internally. This creates race windows with MMU notifiers / workers.
>>>
>>> This is a very intentional choice in the locking design: mmap_read_lock
>>> is held only in very specific parts of GPU SVM, and the driver should
>>> never need to take this lock.
>>>
>>> Yes, notifiers can race, which is why the GPU fault handler and prefetch
>>> handler are structured as retry loops when a notifier race is detected.
>>> In practice, with well-behaved applications, these races should be
>>> rare—but they do occur, and the driver must handle them.
>>>
>>> __xe_svm_handle_pagefault implements the page fault retry loop. VM bind
>>> prefetch has similar logic, although it is more spread out given that it
>>> is part of a deeper software pipeline.
>>>
>>> FWIW, holding locks to avoid races was rejected by Sima because we
>>> reasoned it is essentially impossible to guarantee the absence of races
>>> by holding a lock. CPU page fault handlers are also effectively just
>>> large retry loops.
>>>
>>> So this is one point I believe you will need to fixup driver side.
>>>
>>
>> Understood. Thank you for the detailed explanation and for pointing to
>> __xe_svm_handle_pagefault as a reference. We will restructure both our
>> fault handler and ioctl path to a betterretry loop pattern with sequence
>> number race detection.
>>
> 
> Yes, the typical pattern is:
> 
> - Try to migrate once
> - If you hit a race, give up, evict all memory back to system memory, and bind it
> 
> Atomics make this tricky because memory must move, but I’m not sure
> “XNACK off” applies here. However, GPU SVM provides a timeslice
> mechanism to ensure the CPU can’t move memory while the GPU needs to
> execute something.
> 
>>>>>>
>>>>>> 3. Multi GPU support
>>>>>>
>>>>>> drm_gpusvm binds one drm_device to one instance. In multi GPU systems,
>>>>>> each GPU gets an independent instance with its own range tree, MMU
>>>>>> notifiers, notifier_lock, and DMA mappings.
>>>>>>
>>>
>>> This is a part I am absolutely open to fixing. Right now, each
>>> drm_gpusvm_range has a single set of drm_gpusvm_pages. I am open to
>>> decoupling a GPU SVM instance from a single device, allowing each
>>> drm_gpusvm_range to have multiple sets of drm_gpusvm_pages (one per
>>> device).
>>>
>>> This would give drivers the flexibility to use one GPU SVM instance per
>>> VM/device instance (as in Xe), or to maintain a single GPU SVM per CPU
>>> MM.
>>>
>>
>> That would be wonderful! Looking forward to your patch very much!
>>
> 
> I can't say I'll code this but we thought about is as options and very
> open patches which refactor the object model for multiple use cases.
> 
>>
>>>>>> This may brings huge overhead:
>>>>>>        - N x MMU notifier registrations for the same address range
>>>
>>> The notifier overhead is a real concern. We recently introduced two-pass
>>> notifiers [1] to speed up multi-device notifiers. At least in Xe, the
>>> TLB invalidations—which are the truly expensive part—can be pipelined
>>> using the two=pass approach. Currently, [1] only implements two-pass
>>> notifiers for userptr, but Xe’s GPU SVM will be updated to use them
>>> shortly.
>>>
>>> [1] https://patchwork.freedesktop.org/series/153280/
>>>
>>
>> Thank you for the pointer to two-pass notifiers. Will study this
>> series.
>>
>>>>>>        - N x hmm_range_fault() calls for the same page (KFD: 1x)
>>>
>>> hmm_range_fault is extremely fast compared to the actual migration.
>>> Running hmm_range_fault on a 2MB region using 4KB pages takes less
>>> than 1µs. With THP or large device pages [2] (merged last week), it’s
>>> around 1/20 of a microsecond. So I wouldn’t be too concerned about this.
>>>
>>> [2] https://patchwork.freedesktop.org/series/163141/
>>>
>>
>> That is very helpful data. Perhaps worry too much.
>>
>>>>>>        - N x DMA mapping memory
>>>
>>> You will always have N x DMA mapping memory if the pages are in system
>>> memory as the dma-mapping API is per device.
>>
>> Totally agreed.
>>
>>>
>>>>>>        - N x invalidation + restore worker scheduling per CPU unmap event
>>>>>>        - N x GPU page table flush / TLB invalidation
>>>
>>> I agree you do not want serialize GPU page table flush / TLB
>>> invalidations. Hence two-pass notifiers [1].
>>
>> Yes, will learn it.
>>
>>>
>>>>>>        - Increased mmap_lock hold time, N callbacks serialize under it
>>>>>>
>>>>>> compatibility issues:
>>>>>>        - Quiesce/resume scope mismatch: to integrate with KFD compute
>>>>>>          queues, the driver reuses kgd2kfd_quiesce_mm()/resume_mm()
>>>>>>          which have process level semantics. Under the per GPU
>>>>>>          drm_gpusvm model, maybe there are some issues on sync. To properly
>>>>>>          integrate with KFD under the per SVM model, a compatibility or
>>>>>>          new per VM level queue control APIs maybe need to introduced.
>>>>>>
>>>
>>> I thought the idea to get rid of KFD and move over to AMDGPU? I thought
>>> Christian mentioned this to me at XDC.
>>>
>>
>>>>>> Migration challenges:
>>>>>>
>>>>>>      - No global migration decision logic: each per GPU SVM
>>>>>>        instance maintains its own attribute tree independently. This
>>>>>>        allows conflicting settings (e.g., GPU0's SVM sets
>>>>>>        PREFERRED_LOC=GPU0 while GPU1's SVM sets PREFERRED_LOC=GPU1
>>>>>>        for the same address range) with no detection or resolution.
>>>>>>        A global attribute coordinator or a shared manager is needed to
>>>>>>        provide a unified global view for migration decisions
>>>
>>> Yes, this is hole in the Xe API too. We have told UMDs if they setup
>>> individual VMs with conflict attributes for a single CPU address space
>>> the behavior is undefined. Our UMD implement madvise is basically loop
>>> over al GPU VMs setting the same attributes.
>>
>> Will follow the same approach for now, the UMD is responsible for setting
>> consistent attributes across GPU VMs.
>>
> 
> +1
> 
>>>
>>>>>>
>>>>>>      - migrate_vma_setup broadcast: one GPU's migration triggers MMU
>>>>>>        notifier callbacks in ALL N-1 other drm_gpusvm instances,
>>>>>>        causing N-1 unnecessary restore workers to be scheduled. And
>>>
>>> My feeling is that you shouldn’t reschedule restore workers unless you
>>> actually have to invalidate page tables (i.e., you have a local SVM
>>> range within the notifier). So the first migration to an untouched
>>> region may trigger notifiers, but they won’t do anything because you
>>> don’t have any valid SVM ranges yet. Subsequent mappings of the migrated
>>> region won’t trigger a notifier unless the memory is moved again.
>>>
>>
>> That is a very good point. We should check whether we actually have
>> valid SVM ranges before scheduling restore workers. If there is nothing
>> to invalidate, the notifier callback should be a no-op. We will review
>> our notifier callback logic to ensure we are not doing unnecessary work
>> here. Thank you for pointing this out.
>>
>>>>>>        creates races between the initiating migration and the other
>>>>>>        instance's restore attempts.
>>>
>>> Yes, if multiple devices try to migrate the same CPU pages at the same
>>> time, that will race. That’s why in Xe we have a module-level
>>> driver_migrate_lock. The first migration runs in read mode; if it
>>> detects a race and aborts, it then takes driver_migrate_lock in write
>>> mode so it becomes the only device allowed to move memory / CPU pages.
>>> See xe_svm_alloc_vram() for how this is used.
>>>
>>> I’m not sure this approach will work for you, but I just wanted to point
>>> out that we identified this as a potential issue.
>>>
>>
>> Thank you for sharing the driver_migrate_lock approach and pointing to
>> xe_svm_alloc_vram(). Will explore whether a similar lock pattern can work
>> for our case.
>>
>>>>>>
>>>>>>      - No cross instance migration serialization: each per GPU
>>>>>>        drm_gpusvm instance has independent locking, so two GPUs'
>>>>>>        "decide -> migrate -> remap" sequences can interleave. While
>>>>>>        the kernel page lock prevents truly simultaneous migration of
>>>>>>        the same physical page, the losing side's retry (evict from
>>>>>>        other GPU's VRAM -> migrate back) triggers broadcast notifier
>>>>>>        invalidations and restore workers, compounding the ping pong
>>>>>>        problem above.
>>>>>>
>>>
>>> See the driver_migrate_lock above.
>>
>> Acknowledged, thank you.
>>>
>>>>>>      - No VRAM to VRAM migration: drm_pagemap_migrate_to_devmem()
>>>>>>        hardcodes MIGRATE_VMA_SELECT_SYSTEM (drm_pagemap.c:328), meaning
>>>>>>        it only selects system memory pages for migration.
>>>>>>
>>>
>>> I think this is fixed? We did find some core MM bugs that blocked VRAM
>>> to VRAM but those have been worked out.
>>>
>>> The code I'm looking at:
>>>
>>>    517 int drm_pagemap_migrate_to_devmem(struct drm_pagemap_devmem *devmem_allocation,
>>>    518                                   struct mm_struct *mm,
>>>    519                                   unsigned long start, unsigned long end,
>>>    520                                   const struct drm_pagemap_migrate_details *mdetails)
>>>    521 {
>>>    522         const struct drm_pagemap_devmem_ops *ops = devmem_allocation->ops;
>>>    523         struct drm_pagemap *dpagemap = devmem_allocation->dpagemap;
>>>    524         struct dev_pagemap *pagemap = dpagemap->pagemap;
>>>    525         struct migrate_vma migrate = {
>>>    526                 .start          = start,
>>>    527                 .end            = end,
>>>    528                 .pgmap_owner    = pagemap->owner,
>>>    529                 .flags          = MIGRATE_VMA_SELECT_SYSTEM | MIGRATE_VMA_SELECT_DEVICE_COHERENT |
>>>    530                 MIGRATE_VMA_SELECT_DEVICE_PRIVATE | MIGRATE_VMA_SELECT_COMPOUND,
>>>    531         };
>>>
>>
>> Thank you for checking! I am using v6.18 for this POC, missed the fix, will
>> rebase to the latest.
>>
>>
>>>>>>      - CPU fault reverse migration race: CPU page fault triggers
>>>>>>        migrate_to_ram while GPU instances are concurrently operating.
>>>>>>        Per GPU notifier_lock does not protect cross GPU operations.
>>>
>>> No, again retry loop as discussed above.
>>
>> Understood.
>>
>>>
>>>>>>
>>>>>> We believe a strong, well designed solution at the framework level is
>>>>>> needed to properly address these problems, and we look forward to
>>>>>> discussion and suggestions.
>>>
>>> Let's work together to figure out what is missing here.
>>
>> Thank you so much, Matt. Your feedback has been incredibly valuable and
>> has given us a much clearer picture of the framework's design.
>> Ireally appreciate the effort you put into building drm_gpusvm as a
>> shared framework. Will incorporate your suggestions into our next
>> revision and look forward to continuing the collaboration.
>>
> 
> No problem. Happy to help.
> 
> Matt
> 
>> Regards,
>> Honglei
>>
>>
>>>
>>> Matt
>>>
>>>>>>
>>>>>> Honglei Huang (12):
>>>>>>      drm/amdgpu: add SVM UAPI definitions
>>>>>>      drm/amdgpu: add SVM data structures and header
>>>>>>      drm/amdgpu: add SVM attribute data structures
>>>>>>      drm/amdgpu: implement SVM attribute tree operations
>>>>>>      drm/amdgpu: implement SVM attribute set
>>>>>>      drm/amdgpu: add SVM range data structures
>>>>>>      drm/amdgpu: implement SVM range PTE flags and GPU mapping
>>>>>>      drm/amdgpu: implement SVM range notifier and invalidation
>>>>>>      drm/amdgpu: implement SVM range workers
>>>>>>      drm/amdgpu: implement SVM core initialization and fini
>>>>>>      drm/amdgpu: implement SVM ioctl and fault handler
>>>>>>      drm/amdgpu: wire up SVM build system and fault handler
>>>>>>
>>>>>>     drivers/gpu/drm/amd/amdgpu/Kconfig            |   11 +
>>>>>>     drivers/gpu/drm/amd/amdgpu/Makefile           |   13 +
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |    2 +
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c       |  430 ++++++
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h       |  147 ++
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c  |  894 ++++++++++++
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h  |  110 ++
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c | 1196 +++++++++++++++++
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h |   76 ++
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c        |   40 +-
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h        |    4 +
>>>>>>     include/uapi/drm/amdgpu_drm.h                 |   39 +
>>>>>>     12 files changed, 2958 insertions(+), 4 deletions(-)
>>>>>>     create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.c
>>>>>>     create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm.h
>>>>>>     create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.c
>>>>>>     create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_attr.h
>>>>>>     create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.c
>>>>>>     create mode 100644 drivers/gpu/drm/amd/amdgpu/amdgpu_svm_range.h
>>>>>>
>>>>>>
>>>>>> base-commit: 7d0a66e4bb9081d75c82ec4957c50034cb0ea449
>>>>>
>>>>
>>

next prev parent reply	other threads:[~2026-04-23  6:10 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 01/12] drm/amdgpu: add SVM UAPI definitions Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 02/12] drm/amdgpu: add SVM data structures and header Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 03/12] drm/amdgpu: add SVM attribute data structures Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 04/12] drm/amdgpu: implement SVM attribute tree operations Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 05/12] drm/amdgpu: implement SVM attribute set Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 06/12] drm/amdgpu: add SVM range data structures Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 07/12] drm/amdgpu: implement SVM range PTE flags and GPU mapping Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 08/12] drm/amdgpu: implement SVM range notifier and invalidation Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 09/12] drm/amdgpu: implement SVM range workers Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 10/12] drm/amdgpu: implement SVM core initialization and fini Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 11/12] drm/amdgpu: implement SVM ioctl and fault handler Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 12/12] drm/amdgpu: wire up SVM build system " Honglei Huang
2026-03-17 11:48 ` [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Christian König
2026-03-18  8:59   ` Honglei Huang
2026-03-19  5:08     ` Matthew Brost
2026-03-19 14:17       ` Honglei Huang
2026-03-23  6:31         ` Matthew Brost
2026-03-24  7:24           ` Honglei Huang
2026-03-25 22:24             ` Matthew Brost
2026-03-26 12:16               ` Honglei Huang
2026-04-15 10:04                 ` Huang, Honglei1
2026-04-23  6:40                   ` Matthew Brost
2026-04-23  7:18                     ` Matthew Brost
2026-04-23 11:03                       ` Huang, Honglei1
2026-04-23 20:21                         ` Matthew Brost
2026-04-24 10:43                           ` Huang, Honglei1
2026-04-27 20:00                             ` Felix Kuehling
2026-04-28  2:23                               ` Huang, Honglei1
2026-04-30  3:04                                 ` Matthew Brost
2026-04-23  6:09           ` Huang, Honglei1 [this message]
2026-04-23  6:52             ` Matthew Brost
2026-04-23  8:22               ` Huang, Honglei1
2026-04-29  9:56                 ` Huang, Honglei1
2026-04-30  2:56                   ` Huang, Honglei1
2026-04-30  3:12                     ` Matthew Brost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a9d979e7-6ff1-47c5-87c5-b905e5c84ded@amd.com \
    --to=honghuan@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Jenny-Jing.Liu@amd.com \
    --cc=Junhua.Shen@amd.com \
    --cc=Lingshan.Zhu@amd.com \
    --cc=Oak.Zeng@amd.com \
    --cc=Philip.Yang@amd.com \
    --cc=Ray.Huang@amd.com \
    --cc=Xiaogang.Chen@amd.com \
    --cc=aliceryhl@google.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dakr@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=honglei1.huang@amd.com \
    --cc=matthew.brost@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox