From: Matthew Brost <matthew.brost@intel.com>
To: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>,
"Christian König" <christian.koenig@amd.com>,
intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
airlied@gmail.com, matthew.auld@intel.com, daniel@ffwll.ch,
"Paneer Selvam, Arunpravin" <Arunpravin.PaneerSelvam@amd.com>
Subject: Re: [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration
Date: Thu, 29 Aug 2024 22:12:53 +0000 [thread overview]
Message-ID: <ZtDyZceN6asCzSHT@DUT025-TGLU.fm.intel.com> (raw)
In-Reply-To: <ZtBVXjNf1xAQOHHR@phenom.ffwll.local>
On Thu, Aug 29, 2024 at 01:02:54PM +0200, Daniel Vetter wrote:
> On Thu, Aug 29, 2024 at 11:53:58AM +0200, Thomas Hellström wrote:
> > But as Sima pointed out in private communication, exhaustive eviction
> > is not really needed for faulting to make (crawling) progress.
> > Watermarks and VRAM trylock shrinking should suffice, since we're
> > strictly only required to service a single gpu page granule at a time.
> >
> > However, ordinary bo-based jobs would still like to be able to
> > completely evict SVM vram. Whether that is important enough to strive
> > for is ofc up for discussion.
>
> My take is that you don't win anything for exhaustive eviction by having
> the dma_resv somewhere in there for svm allocations. Roughly for split lru
> world, where svm ignores bo/dma_resv:
>
> When evicting vram from the ttm side we'll fairly switch between selecting
> bo and throwing out svm pages. With drm_exec/ww_acquire_ctx selecting bo
> will eventually succeed in vacuuming up everything (with a few retries
> perhaps, if we're not yet at the head of the ww ticket queue).
>
> svm pages we need to try to evict anyway - there's no guarantee, becaue
> the core mm might be holding temporary page references (which block
Yea, but think you can could kill the app then - not suggesting we
should but could. To me this is akin to a CPU fault and not being able
to migrate the device pages - the migration layer doc says when this
happens kick this to user space and segfault the app.
My last patch in the series adds some asserts to see if this ever
happens, thus far never. If it occurs we could gracefully handle it by
aborting the migration I guess... I think the user really needs to
something a bit crazy to trigger this condition - I don't think the core
randomly grabs refs to device pages but could be wrong.
> migration) or have the page locked (which also block the migration). But
> as long as those two steps succeed, we'll win and get the pages. There
> might be some thrashing against concurrent svm faults stealing them again,
> but they have a disadvantage since they can't steal dma_resv_locked bo.
> And if it's still too much we can stall them in the page allocator.
>
> So it's not entirely reliable, but should be close enough.
>
> Now for bo based svm the picture isn't any different, because holding
> dma_resv is not actually enough to migrate svm mappings. We still need to
> hope there's no temporary page references around, and we still need to
> succeed at locking the page. And the migration code only does trylocks,
> because that's it's deadlock prevent algorithm if different migrations
> needing the same set of pages, but acquiring them in a different order. So
> we win nothing.
Ok, maybe my statement above is false...
Wouldn't be the only time this falls is if another migration is in
flight (e.g. CPU fault) and they race? Then the eviction will naturally
happen via refcount being dropped from the other migration. I guess I
likely need to update my eviction code to not free the TTM resource if
all pages are not migrated.
>
> Worse, if dma_resv does actually hold up svm migration and reclaim, then
> we potentially deadlock because that lock is for a bigger range than
> individual pages (or folios). And the core mm assumes that it can get out
> of a deadlock bind by (at least stochastically) eventually succeeding in
> acquiring/locking down a single page.
>
> This means we cannot use dma_resv tricks to give the ttm world an
> advantage in exhaustive eviction against concurrent svm faults. Or at
> least not more than we can do without by just stalling svm faults that
> need to allocate gpu memory (but that must happen without holding locks or
> we're busted).
>
I'm a little lost here on the deadlock case. Do you mean when we try to
evict SVM BO we trigger reclaim by allocating system pages and can
deadlock? Doesn't TTM already have this dependency when evicting non-SVM
BOs?
> So the only benefit I'm seeing is the unified lru, which I'm not sure is
> worth it. There's also a bit a lru design tension here, because for the bo
Well also not rewriting the world...
Matt
> world we want objects that are locked to stay on the lru, so that the
> competing processes can figure out who has the winning ww ticket. The core
> mm design otoh does isolate pages and remove them from the lru when
> they're acquired, so that they don't gunk up other processes from trying
> to make forward progress and are better hidden. Which reduces temporary
> page references (from lru walk) preventing migration and stuff like that.
>
> Cheers, Sima
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
next prev parent reply other threads:[~2024-08-29 22:14 UTC|newest]
Thread overview: 100+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-28 2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 01/28] dma-buf: Split out dma fence array create into alloc and arm functions Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 02/28] drm/xe: Invalidate media_gt TLBs in PT code Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 03/28] drm/xe: Retry BO allocation Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 04/28] mm/migrate: Add migrate_device_vma_range Matthew Brost
2024-08-29 9:03 ` Daniel Vetter
2024-08-29 15:58 ` Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
2024-08-28 14:31 ` Daniel Vetter
2024-08-28 14:46 ` Christian König
2024-08-28 15:43 ` Matthew Brost
2024-08-28 16:06 ` Alex Deucher
2024-08-28 16:25 ` Daniel Vetter
2024-08-29 16:40 ` Matthew Brost
2024-09-02 11:29 ` Daniel Vetter
2024-08-30 5:00 ` Matthew Brost
2024-09-02 11:36 ` Daniel Vetter
2024-08-28 18:50 ` Daniel Vetter
2024-08-29 16:49 ` Matthew Brost
2024-09-02 11:40 ` Daniel Vetter
2024-08-29 9:16 ` Thomas Hellström
2024-08-29 17:45 ` Matthew Brost
2024-08-29 18:13 ` Matthew Brost
2024-08-29 19:18 ` Thomas Hellström
2024-08-29 20:56 ` Matthew Brost
2024-08-30 8:18 ` Thomas Hellström
2024-08-30 13:58 ` Matthew Brost
2024-09-02 9:57 ` Thomas Hellström
2024-08-30 9:57 ` Thomas Hellström
2024-08-30 13:47 ` Matthew Brost
2024-09-02 9:45 ` Thomas Hellström
2024-09-02 12:33 ` Daniel Vetter
2024-09-04 12:27 ` Thomas Hellström
2024-09-24 8:41 ` Simona Vetter
2024-08-30 1:35 ` Matthew Brost
2024-08-29 9:45 ` Daniel Vetter
2024-08-29 17:27 ` Matthew Brost
2024-09-02 11:53 ` Daniel Vetter
2024-09-02 17:03 ` Matthew Brost
2024-09-11 16:06 ` Matthew Brost
2024-08-30 9:16 ` Thomas Hellström
2024-09-02 12:20 ` Daniel Vetter
2024-09-06 18:41 ` Zeng, Oak
2024-09-24 9:25 ` Simona Vetter
2024-09-25 16:34 ` Zeng, Oak
2024-09-24 10:42 ` Thomas Hellström
2024-09-24 16:30 ` Matthew Brost
2024-09-25 21:12 ` Matthew Brost
2024-10-09 10:50 ` Thomas Hellström
2024-10-16 3:18 ` Matthew Brost
2024-10-16 6:27 ` Thomas Hellström
2024-10-16 8:24 ` Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 06/28] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 07/28] drm/xe: Add SVM init / fini to faulting VMs Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 08/28] drm/xe: Add dma_addr res cursor Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 09/28] drm/xe: Add SVM range invalidation Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 10/28] drm/gpuvm: Add DRM_GPUVA_OP_USER Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 11/28] drm/xe: Add (re)bind to SVM page fault handler Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 12/28] drm/xe: Add SVM garbage collector Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 13/28] drm/xe: Add unbind to " Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 14/28] drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 15/28] drm/xe: Enable system allocator uAPI Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 16/28] drm/xe: Add migrate layer functions for SVM support Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 17/28] drm/xe: Add SVM device memory mirroring Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 18/28] drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 19/28] drm/xe: Update PT layer to understand ranges in VRAM Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 20/28] drm/xe: Add Xe SVM populate_vram_pfn vfunc Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 21/28] drm/xe: Add Xe SVM vram_release vfunc Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 22/28] drm/xe: Add BO flags required for SVM Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration Matthew Brost
2024-08-28 16:06 ` Daniel Vetter
2024-08-28 18:22 ` Daniel Vetter
2024-08-29 9:24 ` Christian König
2024-08-29 9:53 ` Thomas Hellström
2024-08-29 11:02 ` Daniel Vetter
2024-08-29 22:12 ` Matthew Brost [this message]
2024-08-29 22:23 ` Matthew Brost
2024-09-02 11:01 ` Christian König
2024-09-02 12:50 ` Daniel Vetter
2024-09-02 12:48 ` Daniel Vetter
2024-09-02 22:20 ` Matthew Brost
2024-09-03 8:07 ` Simona Vetter
2024-08-29 14:30 ` Christian König
2024-08-29 21:53 ` Matthew Brost
2024-08-29 21:48 ` Matthew Brost
2024-09-02 13:02 ` Daniel Vetter
2024-08-28 2:48 ` [RFC PATCH 24/28] drm/xe: Basic SVM BO eviction Matthew Brost
2024-08-29 10:14 ` Daniel Vetter
2024-08-29 15:55 ` Matthew Brost
2024-09-02 13:05 ` Daniel Vetter
2024-08-28 2:48 ` [RFC PATCH 25/28] drm/xe: Add SVM debug Matthew Brost
2024-08-28 2:48 ` [RFC PATCH 26/28] drm/xe: Add modparam for SVM notifier size Matthew Brost
2024-08-28 2:49 ` [RFC PATCH 27/28] drm/xe: Add modparam for SVM prefault Matthew Brost
2024-08-28 2:49 ` [RFC PATCH 28/28] drm/gpusvm: Ensure all pages migrated upon eviction Matthew Brost
2024-08-28 2:55 ` ✓ CI.Patch_applied: success for Introduce GPU SVM and Xe SVM implementation Patchwork
2024-08-28 2:55 ` ✗ CI.checkpatch: warning " Patchwork
2024-08-28 2:56 ` ✗ CI.KUnit: failure " Patchwork
2024-09-24 9:16 ` [RFC PATCH 00/28] " Simona Vetter
2024-09-24 19:36 ` Matthew Brost
2024-09-25 11:41 ` Simona Vetter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZtDyZceN6asCzSHT@DUT025-TGLU.fm.intel.com \
--to=matthew.brost@intel.com \
--cc=Arunpravin.PaneerSelvam@amd.com \
--cc=airlied@gmail.com \
--cc=christian.koenig@amd.com \
--cc=daniel.vetter@ffwll.ch \
--cc=daniel@ffwll.ch \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.auld@intel.com \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox