From: "Huang, Honglei1" <honghuan@amd.com>
To: "Felix Kuehling" <felix.kuehling@amd.com>,
"Matthew Brost" <matthew.brost@intel.com>,
"Christian König" <christian.koenig@amd.com>
Cc: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
Alexander.Deucher@amd.com,
"Honglei Huang" <honglei1.huang@amd.com>,
Oak.Zeng@amd.com, Jenny-Jing.Liu@amd.com, Philip.Yang@amd.com,
Xiaogang.Chen@amd.com, Ray.Huang@amd.com, Lingshan.Zhu@amd.com,
Junhua.Shen@amd.com,
"Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
"Danilo Krummrich" <dakr@kernel.org>,
"Alice Ryhl" <aliceryhl@google.com>
Subject: Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
Date: Tue, 28 Apr 2026 10:23:18 +0800 [thread overview]
Message-ID: <9b4e3f7c-5d32-4b7b-a270-703325926f86@amd.com> (raw)
In-Reply-To: <66a7efd4-8ec9-47af-b6c4-5be25459a474@amd.com>
On 4/28/2026 4:00 AM, Felix Kuehling wrote:
>
> On 2026-04-24 06:43, Huang, Honglei1 wrote:
>>
>>
>> On 4/24/2026 4:21 AM, Matthew Brost wrote:
>>> On Thu, Apr 23, 2026 at 07:03:52PM +0800, Huang, Honglei1 wrote:
>>>>
>>>>
>>>> On 4/23/2026 3:18 PM, Matthew Brost wrote:
>>>> ...
>>>>>>>>>>> This clarifies a lot. This is what we’d call in Xe
>>>>>>>>>>> “preemption fence”
>>>>>>>>>>> mode for a VM. Anytime memory is moved, we trigger a GPU
>>>>>>>>>>> preemption and
>>>>>>>>>>> resume. We don’t actually support SVM in this case; instead,
>>>>>>>>>>> we use
>>>>>>>>>>> “userptr binds,” which are built on gpusvm for page
>>>>>>>>>>> collection. However,
>>>>>>>>>>> we don’t support migrating memory to the device—though we could.
>>>>>>>>>>>
>>>>>>>>>>> I’d look at how we converted 'userptr' to be based on GPU SVM
>>>>>>>>>>> [2]. In
>>>>>>>>>>> this case, don’t maintain a range tree, as those—as you
>>>>>>>>>>> suggest—are more
>>>>>>>>>>> of an on-demand fault driver concern. Instead, just embed
>>>>>>>>>>> 'struct
>>>>>>>>>>> drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
>>>>>>>>>>>
>>>>>>>>>>> We could extend this to support migrating 'userptr', but we
>>>>>>>>>>> just haven’t
>>>>>>>>>>> done that yet—this may be what you want to do in “XNACK off..
>>>>>>>>>>>
>>>>>>>>>>> [2] https://patchwork.freedesktop.org/series/146553/
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Actually we need to swith the xnack mode between on and off, so
>>>>>>>>>> in xnack off
>>>>>>>>>> mode, the driver operats in "implicit prefetch mode". This may
>>>>>>>>>> be due to
>>>>>>>>>> compatibility with older hardware and the need for UMD
>>>>>>>>>> runtime. We will
>>>>>>>>>> further discuss the handling method under xnack off internally.
>>>>>>>>>>
>>>>>>>
>>>>>>> Hi Matt,
>>>>>>>
>>>>>>> I studied the xe_userptr code and the conversion series [2] you
>>>>>>> pointed to.
>>>>>>>
>>>>>>> I have a question that:
>>>>>>> Would it be possible to reuse drm_gpusvm_range to handle the
>>>>>>> hardware
>>>>>>> without gpu fault feature(xnack off mode).
>>>>>>
>>>>>> That’s not how we’ve done it. We embedded drm_gpusvm_pages into
>>>>>> our VMA
>>>>>> structure and then attached a notifier. The notifier attachment is
>>>>>> open-coded on the Xe side, and this could be normalized and opened up
>>>>>> for common driver use cases.
>>>>
>>>> The way in xe_userptr likes the implementation in kfd_svm: embeded
>>>> physical
>>>> pages into structure and attach same size notifier.
>>>> But kfd_svm is an implementation of SVM semantics, which supports
>>>> partial
>>>> unmap, doesn't need explicitly delete userptr ioctl calling when
>>>> remove ,
>>>> and doesn't need a explicitly userptr flag when creating.
>>>> And actually there is also a existing implementation for userptr
>>>> semantics
>>>> in amdgpu kfd: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR.
>>>> If the no gpu fault mode can not use the drm gpu svm fram work, use
>>>> the same
>>>> way for xe_userptr, it seems like doing the duplicate work.
>>>>
>>>> I think the core gap is we are trying to use the drmgpu_svm to
>>>> implement a
>>>> SVM semantics driver for no gpu fault hardware instead of userptr
>>>> semantics.
>>>>
>>>>>>
>>>>>> The problem with reusing drm_gpusvm_range directly is that a VMA may
>>>>>> span multiple gpusvm notifiers—i.e., it can be larger than the
>>>>>> notifier
>>>>>> size. Of course, we could rework this as well.
>>>>
>>>> So the "VMA spans multiple gpusvm notifiers" concern: I'd like to
>>>> clarify
>>>> that this is not actually a blocker for amdgpu's XNACK-off path,
>>>> because
>>>> amdgpu does not try to represent one user ioctl virtual address
>>>> interval as
>>>> a single drm_gpusvm_range.
>>>>
>>>> we walk the attr interval and call drm_gpusvm_range_find_or_insert()
>>>> repeatedly, letting gpusvm pick chunk aligned ranges bounded by
>>>> notifier_size. One ioctl interval will create N chunk sized ranges.
>>>>
>>>>>>
>>>>>
>>>>> Sorry for the double reply—I just glanced at the latest series. I
>>>>> don’t
>>>>> think creating a range per page of the userptr is desirable. While it
>>>>> would work, from a time-complexity point of view I don’t think this is
>>>>> ideal.
>>>>>
>>>>> The issue with spans across multiple notifiers is real, though.
>>>>>
>>>>> My rough idea would be:
>>>>>
>>>>> - Give drivers an interface to create larger ranges.
>>>>
>>>> So maybe we do not need to create larger ranges if we call
>>>> drm_gpusvm_range_find_or_insert() repeatedly.
>>>>
>>>
>>> That will be functional, but consider it from a time-complexity point of
>>> view.
>>>
>>> Multiple ranges increase the time complexity of range-tree searches.
>>> This isn’t a huge deal, but it will show up to some extent.
>>>
>>> Multiple ranges will also slow down DMA mapping and migration. We
>>> switched over to the dma_iova_alloc/link/unlink/sync uAPI here [1].
>>> While dma_iova_link is a relatively fast radix-tree walk, the allocation
>>> and sync steps are where things get expensive. Therefore, it is
>>> advantageous to perform these steps as few times as possible. For
>>> example, if your SVM buffer is 512MB, instead of doing these steps 256
>>> times, you do them once. The same logic applies to the migrate_vma_*
>>> functions—they are quite expensive, so doing them in a single shot is
>>> significantly faster.
>>>
>>> The same applies to invalidations. If you can invalidate a large range
>>> in a single shot, it will be faster. Although the logic in the notifier
>>> should be able to zap multiple ranges in one shot (Xe does this), having
>>> to DMA-unmap a single large range will still be faster than multiple
>>> smaller DMA unmaps.
>>>
>>> The TL;DR is if your driver knows size of SVM allocation upfront (e.g.,
>>> an IOCTL tells you the size) it makes more sense to use a single large
>>> struct (either embedded drm_gpusvm_pages into a VMA or we figure out an
>>> interface to insert large ranges / notifiers).
>>>
>>> [1] https://patchwork.freedesktop.org/series/160587/
>>>
>>>>>
>>>>> - If the range fits inside a single notifier’s size → done.
>>>>>
>>>>> - If the range spans multiple notifier sizes → round up to a power of
>>>>> two and create a larger notifier. This may overlap with existing
>>>>> notifiers, which is likely fine given that interval trees support
>>>>> overlaps (?). We’d need to double-check and test this. If
>>>>> overlapping
>>>>> notifiers are not acceptable, we’d need some heavy-handed
>>>>> notifier merge
>>>>> logic—it will be complicated, but isolated, so once we get it
>>>>> right
>>>>> everyone can use it.
>>>>
>>>> If we call drm_gpusvm_range_find_or_insert() repeatedly the
>>>> drmgpu_svm will
>>>> create the corresponding notifier correctly as far as I can see.
>>>>
>>>
>>> I agree this will be functional but not ideal. You can always start the
>>> approach you have here and optimize it later by adding the required
>>> support in GPU SVM.
>>>
>>
>> Hi Matt,
>>
>> Really thanks for your information, this really helps a lot!
>>
>>
>> Hi Christian, Felix,
>>
>> According to the discussion with Matt on the previous thread, I'd like
>> to align with you on the XNACK off direction before start to the series.
>>
>> According to the information form Matt:
>> when the allocation size is known doing one big operation is
>> significantly faster than doing many small ranges, because
>> the allocation and sync steps are where things get expensive.
>> Doing them in a single shot is significantly faster, especially in the
>> situlation of xnack off mode, which needs pre fault and pre map in
>> ioctl, and the size is known.
>>
>> It is confirmed that repeatedly calling
>> drm_gpusvm_range_find_or_insert() is
>> functional, and suggested we land it first and optimize later by adding
>> large range support in GPU SVM core. That motivates the two phase plan
>> below.
>>
>> Phase 1
>> - Reuse drm_gpusvm_range for XNACK-off, one ioctl interval is split by
>> drm_gpusvm_range_find_or_insert() into
>> N chunk-sized ranges bounded by notifier_size, same mechanism as the
>> fault path.
>> - populate all ranges at ioctl / submit time instead of on fault.
>> - Invalidation -> GPU queue stop -> rebind/restore the pages and gpu
>> map ->restore queue
>>
>> Phase 2:
>> Add a large range / large notifier insert interface in GPU SVM core
>> so one ioctl interval maps to a single range to improve efficiency.
>> This needs modify the drmgpu_svm frame work.
>>
>> May I know your thoughts on this plan?
>
> I think drm_gpusvm_range_find_or_insert already has all the parameters
> necessary to allocate larger notifiers and ranges. All it would take is
> maybe adding a flag in drm_gpusvm_ctx to request larger range allocation
> instead of arbitrary chunking.
>
> I agree this could be done as a second phase and is mostly work in the
> drm_gpusvm code.
Really thanks for the reply, will implement the large range feature
according your suggestion.
Regards,
Honglei
>
> Regards,
> Felix
>
>
>>
>> Regards,
>> Honglei
>>
>>
>>> Matt
>>>
>>>> Regards,
>>>> Honglei
>>>>
>>>>>
>>>>> - Finally, make sure that individual userptr pages can reside at any
>>>>> location.
>>>>>
>>>>> Over conversely:
>>>>>
>>>>> - Normalize embedding of drm_gpusvm_pages in VMA structs + notifier
>>>>> creation
>>>>>
>>>>> - Make sure that individual userptr pages can reside at any location.
>>>>
>>>>>
>>>>> Both options actually sound really similar after typing this out.
>>>>>
>>>>> Matt
>>>>>
>>>>>> So either way, the Xe userptr + gpusvm implementation should be
>>>>>> refined
>>>>>> further for common driver use.
>>>>>>
>>>>>>>
>>>>>>> Reusing drm_gpusvm_range for the XNACK-off case would simplify our
>>>>>>> implementation considerably, it already provides large page chunk
>>>>>>> optimization, can reuse the existing migration infrastructure.
>>>>>>>
>>>>>>> Building these on top of a standalone drm_gpusvm_pages
>>>>>>> would mean reimplementing much of what the range layer already
>>>>>>> offers.
>>>>>>> It would also let us keep a single code path for both XNACK modes,
>>>>>>> which reduces maintenance burden and avoids behavioral difference.
>>>>>>>
>>>>>>> Would this direction be acceptable, or do you see concerns with
>>>>>>> reusing
>>>>>>> the range infrastructure for the no-fault case?
>>>>>>>
>>>>>>
>>>>>> If you prefer something like insert a range exactly here + create
>>>>>> range
>>>>>> + notifier I think that completely reasonable direction and Xe would
>>>>>> likely switch over to using this.
>>>>>>
>>>>>> I guess my only concern is sub-userptr migration. We are trending
>>>>>> towards allowing userptrs to being migrated either via prefetch
>>>>>> IOCTLs
>>>>>> or access counters on the GPU side - access counter we'd likely a
>>>>>> single
>>>>>> 2M page at time migration within the userptr. get_pages() supports
>>>>>> mixed
>>>>>> mappings between VRAM + system but likely needs some more work to
>>>>>> really
>>>>>> make this complete though.
>>>>>>
>>>>>> Matt
>>>>>>> Regards,
>>>>>>> Honglei
>>>> ...
>>>>
>>>>
>>
next prev parent reply other threads:[~2026-04-28 2:23 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 01/12] drm/amdgpu: add SVM UAPI definitions Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 02/12] drm/amdgpu: add SVM data structures and header Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 03/12] drm/amdgpu: add SVM attribute data structures Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 04/12] drm/amdgpu: implement SVM attribute tree operations Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 05/12] drm/amdgpu: implement SVM attribute set Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 06/12] drm/amdgpu: add SVM range data structures Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 07/12] drm/amdgpu: implement SVM range PTE flags and GPU mapping Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 08/12] drm/amdgpu: implement SVM range notifier and invalidation Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 09/12] drm/amdgpu: implement SVM range workers Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 10/12] drm/amdgpu: implement SVM core initialization and fini Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 11/12] drm/amdgpu: implement SVM ioctl and fault handler Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 12/12] drm/amdgpu: wire up SVM build system " Honglei Huang
2026-03-17 11:48 ` [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Christian König
2026-03-18 8:59 ` Honglei Huang
2026-03-19 5:08 ` Matthew Brost
2026-03-19 14:17 ` Honglei Huang
2026-03-23 6:31 ` Matthew Brost
2026-03-24 7:24 ` Honglei Huang
2026-03-25 22:24 ` Matthew Brost
2026-03-26 12:16 ` Honglei Huang
2026-04-15 10:04 ` Huang, Honglei1
2026-04-23 6:40 ` Matthew Brost
2026-04-23 7:18 ` Matthew Brost
2026-04-23 11:03 ` Huang, Honglei1
2026-04-23 20:21 ` Matthew Brost
2026-04-24 10:43 ` Huang, Honglei1
2026-04-27 20:00 ` Felix Kuehling
2026-04-28 2:23 ` Huang, Honglei1 [this message]
2026-04-30 3:04 ` Matthew Brost
2026-04-23 6:09 ` Huang, Honglei1
2026-04-23 6:52 ` Matthew Brost
2026-04-23 8:22 ` Huang, Honglei1
2026-04-29 9:56 ` Huang, Honglei1
2026-04-30 2:56 ` Huang, Honglei1
2026-04-30 3:12 ` Matthew Brost
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9b4e3f7c-5d32-4b7b-a270-703325926f86@amd.com \
--to=honghuan@amd.com \
--cc=Alexander.Deucher@amd.com \
--cc=Jenny-Jing.Liu@amd.com \
--cc=Junhua.Shen@amd.com \
--cc=Lingshan.Zhu@amd.com \
--cc=Oak.Zeng@amd.com \
--cc=Philip.Yang@amd.com \
--cc=Ray.Huang@amd.com \
--cc=Xiaogang.Chen@amd.com \
--cc=aliceryhl@google.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=dakr@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=felix.kuehling@amd.com \
--cc=honglei1.huang@amd.com \
--cc=matthew.brost@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox