Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm

dri-devel Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: "Huang, Honglei1" <honghuan@amd.com>
Cc: "Felix Kuehling" <felix.kuehling@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	Alexander.Deucher@amd.com,
	"Honglei Huang" <honglei1.huang@amd.com>,
	Oak.Zeng@amd.com, Jenny-Jing.Liu@amd.com, Philip.Yang@amd.com,
	Xiaogang.Chen@amd.com, Ray.Huang@amd.com, Lingshan.Zhu@amd.com,
	Junhua.Shen@amd.com,
	"Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
	"Rodrigo Vivi" <rodrigo.vivi@intel.com>,
	"Danilo Krummrich" <dakr@kernel.org>,
	"Alice Ryhl" <aliceryhl@google.com>
Subject: Re: [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm
Date: Wed, 29 Apr 2026 20:04:35 -0700	[thread overview]
Message-ID: <afLGw7CTYgCVvVIW@gsse-cloud1.jf.intel.com> (raw)
In-Reply-To: <9b4e3f7c-5d32-4b7b-a270-703325926f86@amd.com>

On Tue, Apr 28, 2026 at 10:23:18AM +0800, Huang, Honglei1 wrote:
> 
> 
> On 4/28/2026 4:00 AM, Felix Kuehling wrote:
> > 
> > On 2026-04-24 06:43, Huang, Honglei1 wrote:
> > > 
> > > 
> > > On 4/24/2026 4:21 AM, Matthew Brost wrote:
> > > > On Thu, Apr 23, 2026 at 07:03:52PM +0800, Huang, Honglei1 wrote:
> > > > > 
> > > > > 
> > > > > On 4/23/2026 3:18 PM, Matthew Brost wrote:
> > > > > ...
> > > > > > > > > > > > This clarifies a lot. This is
> > > > > > > > > > > > what we’d call in Xe “preemption
> > > > > > > > > > > > fence”
> > > > > > > > > > > > mode for a VM. Anytime memory is
> > > > > > > > > > > > moved, we trigger a GPU
> > > > > > > > > > > > preemption and
> > > > > > > > > > > > resume. We don’t actually
> > > > > > > > > > > > support SVM in this case;
> > > > > > > > > > > > instead, we use
> > > > > > > > > > > > “userptr binds,” which are built on gpusvm for page
> > > > > > > > > > > > collection. However,
> > > > > > > > > > > > we don’t support migrating memory to the device—though we could.
> > > > > > > > > > > > 
> > > > > > > > > > > > I’d look at how we converted
> > > > > > > > > > > > 'userptr' to be based on GPU SVM
> > > > > > > > > > > > [2]. In
> > > > > > > > > > > > this case, don’t maintain a range tree, as those—as you
> > > > > > > > > > > > suggest—are more
> > > > > > > > > > > > of an on-demand fault driver
> > > > > > > > > > > > concern. Instead, just embed
> > > > > > > > > > > > 'struct
> > > > > > > > > > > > drm_gpusvm_pages' in the VMA struct defined by the IOCTLs..
> > > > > > > > > > > > 
> > > > > > > > > > > > We could extend this to support migrating 'userptr', but we
> > > > > > > > > > > > just haven’t
> > > > > > > > > > > > done that yet—this may be what you want to do in “XNACK off..
> > > > > > > > > > > > 
> > > > > > > > > > > > [2] https://patchwork.freedesktop.org/series/146553/
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Actually we need to swith the xnack mode between on and off, so
> > > > > > > > > > > in xnack off
> > > > > > > > > > > mode, the driver operats in "implicit  prefetch mode". This may
> > > > > > > > > > > be due to
> > > > > > > > > > > compatibility with older hardware
> > > > > > > > > > > and the need for UMD runtime. We
> > > > > > > > > > > will
> > > > > > > > > > > further discuss the handling method under xnack off internally.
> > > > > > > > > > > 
> > > > > > > > 
> > > > > > > > Hi Matt,
> > > > > > > > 
> > > > > > > > I studied the xe_userptr code and the conversion series [2] you
> > > > > > > > pointed to.
> > > > > > > > 
> > > > > > > > I have a question that:
> > > > > > > > Would it be possible to reuse drm_gpusvm_range
> > > > > > > > to handle the hardware
> > > > > > > > without gpu fault feature(xnack off mode).
> > > > > > > 
> > > > > > > That’s not how we’ve done it. We embedded
> > > > > > > drm_gpusvm_pages into our VMA
> > > > > > > structure and then attached a notifier. The notifier attachment is
> > > > > > > open-coded on the Xe side, and this could be normalized and opened up
> > > > > > > for common driver use cases.
> > > > > 
> > > > > The way in xe_userptr likes the implementation in kfd_svm:
> > > > > embeded physical
> > > > > pages into structure and attach same size notifier.
> > > > > But kfd_svm is an implementation of SVM semantics, which
> > > > > supports partial
> > > > > unmap, doesn't need explicitly delete userptr ioctl calling
> > > > > when remove ,
> > > > > and doesn't need a explicitly userptr flag when creating.
> > > > > And actually there is also a existing implementation for
> > > > > userptr semantics
> > > > > in amdgpu kfd: KFD_IOC_ALLOC_MEM_FLAGS_USERPTR.
> > > > > If the no gpu fault mode can not use the drm gpu svm fram
> > > > > work, use the same
> > > > > way for xe_userptr, it seems like doing the duplicate work.
> > > > > 
> > > > > I think the core gap is we are trying to use the drmgpu_svm
> > > > > to implement a
> > > > > SVM semantics driver for no gpu fault hardware instead of
> > > > > userptr semantics.
> > > > > 
> > > > > > > 
> > > > > > > The problem with reusing drm_gpusvm_range directly is that a VMA may
> > > > > > > span multiple gpusvm notifiers—i.e., it can be
> > > > > > > larger than the notifier
> > > > > > > size. Of course, we could rework this as well.
> > > > > 
> > > > > So the "VMA spans multiple gpusvm notifiers" concern: I'd
> > > > > like to clarify
> > > > > that this is not actually a blocker for amdgpu's XNACK-off
> > > > > path, because
> > > > > amdgpu does not try to represent one user ioctl virtual
> > > > > address interval as
> > > > > a single drm_gpusvm_range.
> > > > > 
> > > > > we walk the attr interval and call drm_gpusvm_range_find_or_insert()
> > > > > repeatedly, letting gpusvm pick chunk aligned ranges bounded by
> > > > > notifier_size. One ioctl interval will create N chunk sized ranges.
> > > > > 
> > > > > > > 
> > > > > > 
> > > > > > Sorry for the double reply—I just glanced at the latest
> > > > > > series. I don’t
> > > > > > think creating a range per page of the userptr is desirable. While it
> > > > > > would work, from a time-complexity point of view I don’t think this is
> > > > > > ideal.
> > > > > > 
> > > > > > The issue with spans across multiple notifiers is real, though.
> > > > > > 
> > > > > > My rough idea would be:
> > > > > > 
> > > > > > - Give drivers an interface to create larger ranges.
> > > > > 
> > > > > So maybe we do not need to create larger ranges if we call
> > > > > drm_gpusvm_range_find_or_insert() repeatedly.
> > > > > 
> > > > 
> > > > That will be functional, but consider it from a time-complexity point of
> > > > view.
> > > > 
> > > > Multiple ranges increase the time complexity of range-tree searches.
> > > > This isn’t a huge deal, but it will show up to some extent.
> > > > 
> > > > Multiple ranges will also slow down DMA mapping and migration. We
> > > > switched over to the dma_iova_alloc/link/unlink/sync uAPI here [1].
> > > > While dma_iova_link is a relatively fast radix-tree walk, the allocation
> > > > and sync steps are where things get expensive. Therefore, it is
> > > > advantageous to perform these steps as few times as possible. For
> > > > example, if your SVM buffer is 512MB, instead of doing these steps 256
> > > > times, you do them once. The same logic applies to the migrate_vma_*
> > > > functions—they are quite expensive, so doing them in a single shot is
> > > > significantly faster.
> > > > 
> > > > The same applies to invalidations. If you can invalidate a large range
> > > > in a single shot, it will be faster. Although the logic in the notifier
> > > > should be able to zap multiple ranges in one shot (Xe does this), having
> > > > to DMA-unmap a single large range will still be faster than multiple
> > > > smaller DMA unmaps.
> > > > 
> > > > The TL;DR is if your driver knows size of SVM allocation upfront (e.g.,
> > > > an IOCTL tells you the size) it makes more sense to use a single large
> > > > struct (either embedded drm_gpusvm_pages into a VMA or we figure out an
> > > > interface to insert large ranges / notifiers).
> > > > 
> > > > [1] https://patchwork.freedesktop.org/series/160587/
> > > > 
> > > > > > 
> > > > > > - If the range fits inside a single notifier’s size → done.
> > > > > > 
> > > > > > - If the range spans multiple notifier sizes → round up to a power of
> > > > > >     two and create a larger notifier. This may overlap with existing
> > > > > >     notifiers, which is likely fine given that interval trees support
> > > > > >     overlaps (?). We’d need to double-check and test
> > > > > > this. If overlapping
> > > > > >     notifiers are not acceptable, we’d need some
> > > > > > heavy-handed notifier merge
> > > > > >     logic—it will be complicated, but isolated, so once
> > > > > > we get it right
> > > > > >     everyone can use it.
> > > > > 
> > > > > If we call drm_gpusvm_range_find_or_insert() repeatedly the
> > > > > drmgpu_svm will
> > > > > create the corresponding notifier correctly as far as I can see.
> > > > > 
> > > > 
> > > > I agree this will be functional but not ideal. You can always start the
> > > > approach you have here and optimize it later by adding the required
> > > > support in GPU SVM.
> > > > 
> > > 
> > > Hi Matt,
> > > 
> > > Really thanks for your information, this really helps a lot!
> > > 
> > > 
> > > Hi Christian, Felix,
> > > 
> > > According to the discussion with Matt on the previous thread, I'd
> > > like to align with you on the XNACK off direction before start to
> > > the series.
> > > 
> > > According to the information form Matt:
> > > when the allocation size is known doing one big operation is
> > > significantly faster than doing many small ranges, because
> > > the allocation and sync steps are where things get expensive.
> > > Doing them in a single shot is significantly faster, especially in the
> > > situlation of xnack off mode, which needs pre fault and pre map in
> > > ioctl, and the size is known.
> > > 
> > > It is confirmed that repeatedly calling
> > > drm_gpusvm_range_find_or_insert() is
> > > functional, and suggested we land it first and optimize later by adding
> > > large range support in GPU SVM core. That motivates the two phase
> > > plan below.
> > > 
> > > Phase 1
> > > - Reuse drm_gpusvm_range for XNACK-off, one ioctl interval is split
> > > by drm_gpusvm_range_find_or_insert() into
> > >   N chunk-sized ranges bounded by notifier_size, same mechanism as
> > > the fault path.
> > > - populate all ranges at ioctl / submit time instead of on fault.
> > > - Invalidation -> GPU queue stop -> rebind/restore the pages and gpu
> > > map ->restore queue
> > > 
> > > Phase 2:
> > > Add a large range / large notifier insert interface in GPU SVM core
> > > so one ioctl interval maps to a single range to improve efficiency.
> > > This needs modify the drmgpu_svm frame work.
> > > 
> > > May I know your thoughts on this plan?
> > 
> > I think drm_gpusvm_range_find_or_insert already has all the parameters
> > necessary to allocate larger notifiers and ranges. All it would take is
> > maybe adding a flag in drm_gpusvm_ctx to request larger range allocation
> > instead of arbitrary chunking.

Yes, I agree this a completely reasonable direction. Something like
override 'chunks' with a direct placement + size and then figure out the
notifier install algorithm in gpusvm layer - again this only gets tricky
if a direct placement spans multiple notifiers.

Matt

> > 
> > I agree this could be done as a second phase and is mostly work in the
> > drm_gpusvm code.
> 
> 
> Really thanks for the reply, will implement the large range feature
> according your suggestion.
> 
> Regards,
> Honglei
> 
> > 
> > Regards,
> >    Felix
> > 
> > 
> > > 
> > > Regards,
> > > Honglei
> > > 
> > > 
> > > > Matt
> > > > 
> > > > > Regards,
> > > > > Honglei
> > > > > 
> > > > > > 
> > > > > > - Finally, make sure that individual userptr pages can reside at any
> > > > > >     location.
> > > > > > 
> > > > > > Over conversely:
> > > > > > 
> > > > > > - Normalize embedding of drm_gpusvm_pages in VMA structs + notifier
> > > > > >     creation
> > > > > > 
> > > > > > - Make sure that individual userptr pages can reside at any location.
> > > > > 
> > > > > > 
> > > > > > Both options actually sound really similar after typing this out.
> > > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > > So either way, the Xe userptr + gpusvm
> > > > > > > implementation should be refined
> > > > > > > further for common driver use.
> > > > > > > 
> > > > > > > > 
> > > > > > > > Reusing drm_gpusvm_range for the XNACK-off case would simplify our
> > > > > > > > implementation considerably, it already provides large page chunk
> > > > > > > > optimization, can reuse the existing migration infrastructure.
> > > > > > > > 
> > > > > > > > Building these on top of a standalone drm_gpusvm_pages
> > > > > > > > would mean reimplementing much of what the range
> > > > > > > > layer already offers.
> > > > > > > > It would also let us keep a single code path for both XNACK modes,
> > > > > > > > which reduces maintenance burden and avoids behavioral difference.
> > > > > > > > 
> > > > > > > > Would this direction be acceptable, or do you
> > > > > > > > see concerns with reusing
> > > > > > > > the range infrastructure for the no-fault case?
> > > > > > > > 
> > > > > > > 
> > > > > > > If you prefer something like insert a range exactly
> > > > > > > here + create range
> > > > > > > + notifier I think that completely reasonable direction and Xe would
> > > > > > > likely switch over to using this.
> > > > > > > 
> > > > > > > I guess my only concern is sub-userptr migration. We are trending
> > > > > > > towards allowing userptrs to being migrated either
> > > > > > > via prefetch IOCTLs
> > > > > > > or access counters on the GPU side - access counter
> > > > > > > we'd likely a single
> > > > > > > 2M page at time migration within the userptr.
> > > > > > > get_pages() supports mixed
> > > > > > > mappings between VRAM + system but likely needs some
> > > > > > > more work to really
> > > > > > > make this complete though.
> > > > > > > 
> > > > > > > Matt
> > > > > > > > Regards,
> > > > > > > > Honglei
> > > > > ...
> > > > > 
> > > > > 
> > > 
>

next prev parent reply	other threads:[~2026-04-30  3:04 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-17 11:29 [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 01/12] drm/amdgpu: add SVM UAPI definitions Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 02/12] drm/amdgpu: add SVM data structures and header Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 03/12] drm/amdgpu: add SVM attribute data structures Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 04/12] drm/amdgpu: implement SVM attribute tree operations Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 05/12] drm/amdgpu: implement SVM attribute set Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 06/12] drm/amdgpu: add SVM range data structures Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 07/12] drm/amdgpu: implement SVM range PTE flags and GPU mapping Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 08/12] drm/amdgpu: implement SVM range notifier and invalidation Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 09/12] drm/amdgpu: implement SVM range workers Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 10/12] drm/amdgpu: implement SVM core initialization and fini Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 11/12] drm/amdgpu: implement SVM ioctl and fault handler Honglei Huang
2026-03-17 11:29 ` [RFC/POC PATCH 12/12] drm/amdgpu: wire up SVM build system " Honglei Huang
2026-03-17 11:48 ` [RFC/POC PATCH 00/12] POC SVM implementation in AMDGPU based on drm_gpusvm Christian König
2026-03-18  8:59   ` Honglei Huang
2026-03-19  5:08     ` Matthew Brost
2026-03-19 14:17       ` Honglei Huang
2026-03-23  6:31         ` Matthew Brost
2026-03-24  7:24           ` Honglei Huang
2026-03-25 22:24             ` Matthew Brost
2026-03-26 12:16               ` Honglei Huang
2026-04-15 10:04                 ` Huang, Honglei1
2026-04-23  6:40                   ` Matthew Brost
2026-04-23  7:18                     ` Matthew Brost
2026-04-23 11:03                       ` Huang, Honglei1
2026-04-23 20:21                         ` Matthew Brost
2026-04-24 10:43                           ` Huang, Honglei1
2026-04-27 20:00                             ` Felix Kuehling
2026-04-28  2:23                               ` Huang, Honglei1
2026-04-30  3:04                                 ` Matthew Brost [this message]
2026-04-23  6:09           ` Huang, Honglei1
2026-04-23  6:52             ` Matthew Brost
2026-04-23  8:22               ` Huang, Honglei1
2026-04-29  9:56                 ` Huang, Honglei1
2026-04-30  2:56                   ` Huang, Honglei1
2026-04-30  3:12                     ` Matthew Brost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=afLGw7CTYgCVvVIW@gsse-cloud1.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Jenny-Jing.Liu@amd.com \
    --cc=Junhua.Shen@amd.com \
    --cc=Lingshan.Zhu@amd.com \
    --cc=Oak.Zeng@amd.com \
    --cc=Philip.Yang@amd.com \
    --cc=Ray.Huang@amd.com \
    --cc=Xiaogang.Chen@amd.com \
    --cc=aliceryhl@google.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dakr@kernel.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=felix.kuehling@amd.com \
    --cc=honghuan@amd.com \
    --cc=honglei1.huang@amd.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox