From: Matthew Brost <matthew.brost@intel.com>
To: Alistair Popple <apopple@nvidia.com>
Cc: <intel-xe@lists.freedesktop.org>,
<dri-devel@lists.freedesktop.org>,
<himal.prasad.ghimiray@intel.com>, <airlied@gmail.com>,
<thomas.hellstrom@linux.intel.com>, <simona.vetter@ffwll.ch>,
<felix.kuehling@amd.com>, <dakr@kernel.org>
Subject: Re: [PATCH v6 32/32] drm/doc: gpusvm: Add GPU SVM documentation
Date: Fri, 28 Feb 2025 16:35:42 -0800 [thread overview]
Message-ID: <Z8JWXk9GRlGciaDk@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <7vzwp4ouidxfoyu2qehkezy56tt52e4vjbthr6cc4ezr5iaw7b@plxyy7hf46dm>
On Fri, Feb 28, 2025 at 04:53:11PM +1100, Alistair Popple wrote:
> On Thu, Feb 27, 2025 at 08:36:35PM -0800, Matthew Brost wrote:
> > On Fri, Feb 28, 2025 at 01:34:42PM +1100, Alistair Popple wrote:
> > > On Mon, Feb 24, 2025 at 08:43:11PM -0800, Matthew Brost wrote:
> > > > Add documentation for agree upon GPU SVM design principles, current
> > > > status, and future plans.
> > >
> > > Thanks for writing this up. In general I didn't see anything too controversial
> > > but added a couple of comments below.
> > >
> > > >
> > > > v4:
> > > > - Address Thomas's feedback
> > > > v5:
> > > > - s/Current/Basline (Thomas)
> > > >
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > > ---
> > > > Documentation/gpu/rfc/gpusvm.rst | 84 ++++++++++++++++++++++++++++++++
> > > > Documentation/gpu/rfc/index.rst | 4 ++
> > > > 2 files changed, 88 insertions(+)
> > > > create mode 100644 Documentation/gpu/rfc/gpusvm.rst
> > > >
> > > > diff --git a/Documentation/gpu/rfc/gpusvm.rst b/Documentation/gpu/rfc/gpusvm.rst
> > > > new file mode 100644
> > > > index 000000000000..063412160685
> > > > --- /dev/null
> > > > +++ b/Documentation/gpu/rfc/gpusvm.rst
> > > > @@ -0,0 +1,84 @@
> > > > +===============
> > > > +GPU SVM Section
> > > > +===============
> > > > +
> > > > +Agreed upon design principles
> > > > +=============================
> > >
> > > As a general comment I think it would be nice if we could add some rational/
> > > reasons for these design principals. Things inevitably change and if/when
> > > we need to violate or update these principals it would be good to have some
> > > documented rational for why we decided on them in the first place because the
> > > reasoning may have become invalid by then.
> > >
> >
> > Let me try to add somethings to the various cases.
>
> Thanks!
>
> > > > +* migrate_to_ram path
> > > > + * Rely only on core MM concepts (migration PTEs, page references, and
> > > > + page locking).
> > > > + * No driver specific locks other than locks for hardware interaction in
> > > > + this path. These are not required and generally a bad idea to
> > > > + invent driver defined locks to seal core MM races.
> > >
> > > In principal I agree. The problem I think you will run into is the analogue of
> > > what adding a trylock_page() to do_swap_page() fixes. Which is that a concurrent
> > > GPU fault (which is higly likely after handling a CPU fault due to the GPU PTEs
> > > becoming invalid) may, depending on your design, kick off a migration of the
> > > page to the GPU via migrate_vma_setup().
> > >
> > > The problem with that is migrate_vma_setup() will temprarily raise the folio
> > > refcount, which can cause the migrate_to_ram() callback to fail but the elevated
> > > refcount from migrate_to_ram() can also cause the GPU migration to fail thus
> > > leading to a live-lock when both CPU and GPU fault handlers just keep retrying.
> > >
> > > This was particularly problematic for us on multi-GPU setups, and our solution
> > > was to introduce a migration critical section in the form of a mutex to ensure
> > > only one thread was calling migrate_vma_setup() at a time.
> > >
> > > And now that I've looked at UVM development history, and remembered more
> > > context, this is why I had a vague recollection that adding a migration entry
> > > in do_swap_page() would be better than taking a page lock. Doing so fixes the
> > > issue with concurrent GPU faults blocking migrate_to_ram() because it makes
> > > migrate_vma_setup() ignore the page.
> > >
> >
> > Ok, this is something to keep an eye on. In the current Xe code, we try
> > to migrate a chunk of memory from the CPU to the GPU in our GPU fault
> > handler once per fault. If it fails due to racing CPU access, we simply
> > leave it in CPU memory and move on. We don't have any real migration
> > policies in Xe yet—that is being worked on as a follow-up to my series.
> > However, if we had a policy requiring a memory region to 'must be in
> > GPU,' this could conceivably lead to a livelock with concurrent CPU and
> > GPU access. I'm still not fully convinced that a driver-side lock is the
> > solution here, but without encountering the issue on our side, I can't
> > be completely certain what the solution is.
>
> Right - we have migration policies that can cause us to try harder to migrate.
> Also I agree with you that a driver-side lock might not be the best solution
> here. It's what we did due to various limiations we have, but they are
> unimportant for this discussion.
>
> I agree the ideal solution wouldn't involve locks and would instead be to fix
> the migration interfaces up such that one thread attempting to migrate doesn't
> cause another thread which has started a migration to fail. The solution to that
> isn't obvious, but I don't think it would be impossible either.
>
I agree this would be a good possible solution. Will keep this in mind
when we start implementing on the Xe side.
> > > > + * Partial migration is supported (i.e., a subset of pages attempting to
> > > > + migrate can actually migrate, with only the faulting page guaranteed
> > > > + to migrate).
> > > > + * Driver handles mixed migrations via retry loops rather than locking.
> > > >
> > > > +* Eviction
> > >
> > > This is a term that seems be somewhat overloaded depending on context so a
> > > definition would be nice. Is your view of eviction migrating data from GPU back
> > > to CPU without a virtual address to free up GPU memory? (that's what I think of,
> > > but would be good to make sure we're in sync).
> > >
> >
> > Yes. When GPU memory is oversubscribed, we find the physical backing in
> > an LRU list to evict. In Xe, this is a TTM BO.
>
> Sounds good. So eviction is just migration of physical memory.
>
> > > > + * Only looking at physical memory data structures and locks as opposed to
> > > > + looking at virtual memory data structures and locks.
>
> Except of course whatever virtual memory data structures the core-MM needs to
> touch in order to do the migration right? Agree that the driver shouldn't be
> touching any driver data structures that concern themselves with virtual memory
> addresses though. Except what about any data structures that are required as
> part of GPU PTE/TLB invalidation?
>
The eviction is triggered via the new function I added in this series
migrate_device_pfns, this triggers the notifier which invalidates the
virtual GPU address page tables. So the eviction code itself is only
looking at the physical memory. I can add something here to make this a
bit clear.
> > > > + * No looking at mm/vma structs or relying on those being locked.
> > >
> > > Agree with the above points.
> > >
> > > > +* GPU fault side
> > > > + * mmap_read only used around core MM functions which require this lock
> > > > + and should strive to take mmap_read lock only in GPU SVM layer.
> > > > + * Big retry loop to handle all races with the mmu notifier under the gpu
> > > > + pagetable locks/mmu notifier range lock/whatever we end up calling
> > > > + those.
> > >
> > > Again, one of the issues here (particularly with multi-GPU setups) is that it's
> > > very easy to live-lock with rety loops because even attempting a migration that
> > > fails can cause migration/fault handling in other threads to fail, either by
> > > calling mmu_notifiers or taking a page reference.
> > >
> > > Those are probably things that we should fix on the MM side, but for now UVM at
> > > least uses a lock to ensure forward progress.
> > >
> >
> > Again, see above. Right now, migration in Xe is more of a best-case
> > scenario rather than a mandatory process, and perhaps this is masking an
> > issue.
> >
> > Maybe I should add a comment here stating your possible concerns and that
> > Xe will be implementing real migration policies and multi-GPU support
> > soon. If this issue arises, we can revisit the locking guidelines or
> > perhaps help contribute to the necessary core changes to make this work
> > properly.
>
> Yeah, that could be good. Something along the lines of core-MM code may need
> fixing in the way I described above (one thread attempting a migration shouldn't
> cause another thread that's already started one to fail).
>
+1
Matt
> > > > + * Races (especially against concurrent eviction or migrate_to_ram)
> > > > + should not be handled on the fault side by trying to hold locks;
> > > > + rather, they should be handled using retry loops. One possible
> > > > + exception is holding a BO's dma-resv lock during the initial migration
> > > > + to VRAM, as this is a well-defined lock that can be taken underneath
> > > > + the mmap_read lock.
> > >
> > > See my earlier comments. Although note I agree with this in principal, and we do
> > > just retry if taking the lock fails.
> > >
> > > > +* Physical memory to virtual backpointer
> > > > + * Does not work, no pointers from physical memory to virtual should
> > > > + exist.
> > >
> > > Agree. And my rational is because core-MM can update the virtual address for a
> > > page without notifying a driver of the new address. For example with mremap().
> > > So it's impossible to keep any backpointer to a virtual address up to date.
> > >
> >
> > Yep, this is good example and will include this in my next rev.
> >
> > > > + * Physical memory backpointer (page->zone_device_data) should be stable
> > > > + from allocation to page free.
> > >
> > > Agree. And presumably the rational is because it is very difficult to safely
> > > update page->zone_device_data and ensure there aren't concurrent users of it
> > > unless the page is free (ie. has a 0 refcount)?
> > >
> >
> > Yes, exactly.
> >
> > > > +* GPU pagetable locking
> > > > + * Notifier lock only protects range tree, pages valid state for a range
> > > > + (rather than seqno due to wider notifiers), pagetable entries, and
> > > > + mmu notifier seqno tracking, it is not a global lock to protect
> > > > + against races.
> > > > + * All races handled with big retry as mentioned above.
> > >
> > > Sounds reasonable.
> > >
> > > > +Overview of current design
> > > > +==========================
> > > > +
> > > > +Baseline design is simple as possible to get a working basline in which can be
> > > > +built upon.
> > > > +
> > > > +.. kernel-doc:: drivers/gpu/drm/xe/drm_gpusvm.c
> > > > + :doc: Overview
> > > > + :doc: Locking
> > > > + :doc: Migrataion
> > > > + :doc: Partial Unmapping of Ranges
> > > > + :doc: Examples
> > > > +
> > > > +Possible future design features
> > > > +===============================
> > > > +
> > > > +* Concurrent GPU faults
> > > > + * CPU faults are concurrent so makes sense to have concurrent GPU
> > > > + faults.
> > > > + * Should be possible with fined grained locking in the driver GPU
> > > > + fault handler.
> > > > + * No expected GPU SVM changes required.
> > > > +* Ranges with mixed system and device pages
> > > > + * Can be added if required to drm_gpusvm_get_pages fairly easily.
> > >
> > > I don't see much of a need, but also don't see any barriers on the MM side for
> > > doing that.
> > >
> >
> > I don't see any barriers either, I think it would work in Xe with slight
> > tweak to our VM bind code.
> >
> > I'm unsure the use case though too.
> >
> > > > +* Multi-GPU support
> > > > + * Work in progress and patches expected after initially landing on GPU
> > > > + SVM.
> > > > + * Ideally can be done with little to no changes to GPU SVM.
> > >
> > > See above, but I mostly agree.
> > >
> > > > +* Drop ranges in favor of radix tree
> > > > + * May be desirable for faster notifiers.
> > > > +* Compound device pages
> > > > + * Nvidia, AMD, and Intel all have agreed expensive core MM functions in
> > > > + migrate device layer are a performance bottleneck, having compound
> > > > + device pages should help increase performance by reducing the number
> > > > + of these expensive calls.
> > >
> > > I'm hoping my patch series[1] to allow for compound device pages will land in v6.15
> >
> > Cool! I was not aware of this ongoing series. Let me look.
>
> There's probably not much of direct interest to you there at the moment. It's a
> prerequisite in that it allows current (and therefore future) users of compound
> ZONE_DEVICE pages to have them ref-counted normally, instead of the funky scheme
> DAX uses at the moment. Our changes will build on top of that.
>
> > > as well. The next steps are extending that to DEVICE_PRIVATE pages with
> > > migrate_vma_setup() and migrate_to_ram() and we have been making some good
> > > progress on this internally. I hope to have something posted before LSFMM.
> > >
> >
> > Also cool. If you think you have something working, let me know and will
> > pull in changes to test out.
>
> Will do!
>
> > > The other thing we have been looking at here is being able to migrate
> > > file-backed pages to GPU memory.
> >
> > Also can test this one out too.
>
> Thanks.
>
> - Alistair
>
> > Matt
> >
> > >
> > > [1] - https://lore.kernel.org/linux-mm/cover.a782e309b1328f961da88abddbbc48e5b4579021.1739850794.git-series.apopple@nvidia.com/
> > >
> > > > +* Higher order dma mapping for migration
> > > > + * 4k dma mapping adversely affects migration performance on Intel
> > > > + hardware, higher order (2M) dma mapping should help here.
> > > > +* Build common userptr implementation on top of GPU SVM
> > > > diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst
> > > > index 476719771eef..396e535377fb 100644
> > > > --- a/Documentation/gpu/rfc/index.rst
> > > > +++ b/Documentation/gpu/rfc/index.rst
> > > > @@ -16,6 +16,10 @@ host such documentation:
> > > > * Once the code has landed move all the documentation to the right places in
> > > > the main core, helper or driver sections.
> > > >
> > > > +.. toctree::
> > > > +
> > > > + gpusvm.rst
> > > > +
> > > > .. toctree::
> > > >
> > > > i915_gem_lmem.rst
> > > > --
> > > > 2.34.1
> > > >
next prev parent reply other threads:[~2025-03-01 0:35 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-25 4:42 [PATCH v6 00/32] Introduce GPU SVM and Xe SVM implementation Matthew Brost
2025-02-25 4:42 ` [PATCH v6 01/32] drm/xe: Retry BO allocation Matthew Brost
2025-02-25 4:42 ` [PATCH v6 02/32] mm/migrate: Add migrate_device_pfns Matthew Brost
2025-02-25 4:42 ` [PATCH v6 03/32] mm/migrate: Trylock device page in do_swap_page Matthew Brost
2025-02-25 4:42 ` [PATCH v6 04/32] drm/pagemap: Add DRM pagemap Matthew Brost
2025-02-25 15:06 ` Matthew Auld
2025-02-25 18:16 ` Matthew Brost
2025-02-25 4:42 ` [PATCH v6 05/32] drm/xe/bo: Introduce xe_bo_put_async Matthew Brost
2025-02-25 4:42 ` [PATCH v6 06/32] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
2025-02-25 15:14 ` Matthew Auld
2025-02-25 18:16 ` Matthew Brost
2025-02-25 4:42 ` [PATCH v6 07/32] drm/xe: Select DRM_GPUSVM Kconfig Matthew Brost
2025-02-25 4:42 ` [PATCH v6 08/32] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR Matthew Brost
2025-02-25 4:42 ` [PATCH v6 09/32] drm/xe: Add SVM init / close / fini to faulting VMs Matthew Brost
2025-02-25 4:42 ` [PATCH v6 10/32] drm/xe: Add dma_addr res cursor Matthew Brost
2025-02-25 4:42 ` [PATCH v6 11/32] drm/xe: Nuke VM's mapping upon close Matthew Brost
2025-02-25 18:05 ` Matthew Auld
2025-02-25 18:14 ` Matthew Brost
2025-02-25 4:42 ` [PATCH v6 12/32] drm/xe: Add SVM range invalidation and page fault Matthew Brost
2025-02-25 4:42 ` [PATCH v6 13/32] drm/gpuvm: Add DRM_GPUVA_OP_DRIVER Matthew Brost
2025-02-25 4:42 ` [PATCH v6 14/32] drm/xe: Add (re)bind to SVM page fault handler Matthew Brost
2025-02-26 17:00 ` Thomas Hellström
2025-02-26 17:18 ` Ghimiray, Himal Prasad
2025-02-25 4:42 ` [PATCH v6 15/32] drm/xe: Add SVM garbage collector Matthew Brost
2025-02-25 4:42 ` [PATCH v6 16/32] drm/xe: Add unbind to " Matthew Brost
2025-02-25 4:42 ` [PATCH v6 17/32] drm/xe: Do not allow CPU address mirror VMA unbind if the GPU has bindings Matthew Brost
2025-02-27 17:01 ` Thomas Hellström
2025-02-25 4:42 ` [PATCH v6 18/32] drm/xe: Enable CPU address mirror uAPI Matthew Brost
2025-02-25 4:42 ` [PATCH v6 19/32] drm/xe/uapi: Add DRM_XE_QUERY_CONFIG_FLAG_HAS_CPU_ADDR_MIRROR Matthew Brost
2025-02-25 4:42 ` [PATCH v6 20/32] drm/xe: Add migrate layer functions for SVM support Matthew Brost
2025-02-25 4:43 ` [PATCH v6 21/32] drm/xe: Add SVM device memory mirroring Matthew Brost
2025-02-25 4:43 ` [PATCH v6 22/32] drm/xe: Add drm_gpusvm_devmem to xe_bo Matthew Brost
2025-02-25 4:43 ` [PATCH v6 23/32] drm/xe: Add drm_pagemap ops to SVM Matthew Brost
2025-02-25 4:43 ` [PATCH v6 24/32] drm/xe: Add GPUSVM device memory copy vfunc functions Matthew Brost
2025-02-25 4:43 ` [PATCH v6 25/32] drm/xe: Add Xe SVM populate_devmem_pfn GPU SVM vfunc Matthew Brost
2025-02-25 4:43 ` [PATCH v6 26/32] drm/xe: Add Xe SVM devmem_release " Matthew Brost
2025-02-25 4:43 ` [PATCH v6 27/32] drm/xe: Add SVM VRAM migration Matthew Brost
2025-02-26 16:47 ` Thomas Hellström
2025-02-26 17:16 ` Ghimiray, Himal Prasad
2025-02-25 4:43 ` [PATCH v6 28/32] drm/xe: Basic SVM BO eviction Matthew Brost
2025-02-25 4:43 ` [PATCH v6 29/32] drm/xe: Add SVM debug Matthew Brost
2025-02-25 4:43 ` [PATCH v6 30/32] drm/xe: Add modparam for SVM notifier size Matthew Brost
2025-02-25 4:43 ` [PATCH v6 31/32] drm/xe: Add always_migrate_to_vram modparam Matthew Brost
2025-02-25 4:43 ` [PATCH v6 32/32] drm/doc: gpusvm: Add GPU SVM documentation Matthew Brost
2025-02-28 2:34 ` Alistair Popple
2025-02-28 4:36 ` Matthew Brost
2025-02-28 5:53 ` Alistair Popple
2025-03-01 0:35 ` Matthew Brost [this message]
2025-02-25 4:50 ` ✓ CI.Patch_applied: success for Introduce GPU SVM and Xe SVM implementation (rev6) Patchwork
2025-02-25 4:51 ` ✗ CI.checkpatch: warning " Patchwork
2025-02-25 4:52 ` ✓ CI.KUnit: success " Patchwork
2025-02-25 5:08 ` ✓ CI.Build: " Patchwork
2025-02-25 5:10 ` ✗ CI.Hooks: failure " Patchwork
2025-02-25 5:12 ` ✗ CI.checksparse: warning " Patchwork
2025-02-25 5:32 ` ✓ Xe.CI.BAT: success " Patchwork
2025-02-25 9:55 ` ✗ Xe.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z8JWXk9GRlGciaDk@lstrano-desk.jf.intel.com \
--to=matthew.brost@intel.com \
--cc=airlied@gmail.com \
--cc=apopple@nvidia.com \
--cc=dakr@kernel.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=felix.kuehling@amd.com \
--cc=himal.prasad.ghimiray@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=simona.vetter@ffwll.ch \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox