Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Daniel Vetter <daniel.vetter@ffwll.ch>
To: Matthew Brost <matthew.brost@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>,
	intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	airlied@gmail.com, christian.koenig@amd.com,
	thomas.hellstrom@linux.intel.com, matthew.auld@intel.com,
	daniel@ffwll.ch
Subject: Re: [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory
Date: Mon, 2 Sep 2024 13:53:14 +0200	[thread overview]
Message-ID: <ZtWnKvaV964EA6mq@phenom.ffwll.local> (raw)
In-Reply-To: <ZtCvcVu3SZsManOw@DUT025-TGLU.fm.intel.com>

On Thu, Aug 29, 2024 at 05:27:13PM +0000, Matthew Brost wrote:
> On Thu, Aug 29, 2024 at 11:45:08AM +0200, Daniel Vetter wrote:
> > On Tue, Aug 27, 2024 at 07:48:38PM -0700, Matthew Brost wrote:
> > > This patch introduces support for GPU Shared Virtual Memory (SVM) in the
> > > Direct Rendering Manager (DRM) subsystem. SVM allows for seamless
> > > sharing of memory between the CPU and GPU, enhancing performance and
> > > flexibility in GPU computing tasks.
> > > 
> > > The patch adds the necessary infrastructure for SVM, including data
> > > structures and functions for managing SVM ranges and notifiers. It also
> > > provides mechanisms for allocating, deallocating, and migrating memory
> > > regions between system RAM and GPU VRAM.
> > > 
> > > This mid-layer is largely inspired by GPUVM.
> > > 
> > > Cc: Dave Airlie <airlied@redhat.com>
> > > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>
> > > Cc: <dri-devel@lists.freedesktop.org>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > 
> > Still not sure I've got the right race that you paper over with
> > mmap_write_lock, but I spotted a few things, commments inline.
> > 
> 
> I've replied to this issue several times, let's table the
> mmap_write_lock issue in this reply - a lot of other things to get
> through. Current thinking is try to add a range->migrate_lock like AMD
> which I state here [1]. Let's continue discussing the mmap lock issue
> there if possible.

Yeah I wrote replies as I read code, so there's a bit a mess from my side
here. Apologies for that.

> [1] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1111169

Some more replies below that I think we haven't covered anywhere else yet.

> > > + * 2) Garbage Collector.
> > > + *
> > > + *	void __driver_garbage_collector(struct drm_gpusvm *gpusvm,
> > > + *					struct drm_gpusvm_range *range)
> > > + *	{
> > > + *		struct drm_gpusvm_ctx ctx = {};
> > > + *
> > > + *		assert_driver_svm_locked(gpusvm);
> > > + *
> > > + *		// Partial unmap, migrate any remaining VRAM pages back to SRAM
> > > + *		if (range->flags.partial_unmap)
> > > + *			drm_gpusvm_migrate_to_sram(gpusvm, range, &ctx);
> > 
> > Note that the migration back to sram isn't guaranteed to succeed, so you
> > might be still stuck with partially migrated range. This might be a case
> > where hmm gives you vram pfns, but the range you have doesn't have any
> > vram allocation anymore because you droppped it here. Not sure tbh.
> >
> 
> Hmm isn't the picture here nor will a VMA once the
> drm_gpusvm_evict_to_sram path is always taken as discussed here [2]. I
> might have a corner case BO refcounting / TTM resource lookup bug in
> somewhere in here which needs to be resolved though (e.g. eviction
> racing with this code path), will try to close on that.
> 
> [2] https://patchwork.freedesktop.org/patch/610955/?series=137870&rev=1#comment_1111164

So maybe my understanding is wrong, but from my reading of the device
migration code the exact same non-guarantees as for the sram2sram
migration code apply:

- There's no guarantee the page/folio doesn't have an elevated refcount,
  which makes the migration fail (in try_to_migrate, where it checks for
  surplus refcounts).

- There's no guarantee you'll get the page/folio lock, which makes the
  migration fail. Worse the core mm seems to use a fallback to per-page
  locking as it's extremely crude "get out of deadlocks due to acquiring
  multiple page locks" card.

> > > +map_pages:
> > > +	if (is_device_private_page(hmm_pfn_to_page(pfns[0]))) {
> > > +		WARN_ON_ONCE(!range->vram_allocation);
> > > +
> > > +		for (i = 0; i < npages; ++i) {
> > > +			pages[i] = hmm_pfn_to_page(pfns[i]);
> > > +
> > > +			if (WARN_ON_ONCE(!is_device_private_page(pages[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				goto err_free;
> > > +			}
> > > +		}
> > 
> > You can't do the above, because the pfn you get from hmm come with zero
> > guarantees, you neither hold a page reference nor the page lock. The only
> > thing you can do is grab the pagetable lock (or mmu notifier locks) and
> > check it's still valid, before you can touch any state. I think the
> > range->vram_allocation is probably always valid since you clean that up
> > under the same lock/thread, but there's good chances the vram allocation
> > is otherwise already gone for good. Or you get an inconsistent snapshot.
> > 
> 
> I haven't seen this pop in my testing yet which is fairly thorough. My
> thinking was migration always being enforced at range grainularity we'd
> never get mixed mappings from the core as migration is completely under
> control of the driver. Maybe I'm not understanding what you are saying
> here...

So one scenario is that you race (without the mmap write lock or the
migration_mutex design ofc) with another invalidate, and get a partial
view here of mixed vram and sram pages. Until you acquire the mmu notifier
lock and have made sure your pages are still valid, there's essentially no
guarantee.
> 
> > > +
> > > +		/* Do not race with notifier unmapping pages */
> > > +		drm_gpusvm_notifier_lock(gpusvm);
> > > +		range->flags.has_vram_pages = true;
> > > +		range->pages = pages;
> > > +		if (mmu_interval_read_retry(notifier, hmm_range.notifier_seq)) {
> > > +			err = -EAGAIN;
> > > +			__drm_gpusvm_range_unmap_pages(gpusvm, range);
> > > +		}
> > > +		drm_gpusvm_notifier_unlock(gpusvm);
> > > +	} else {
> > > +		dma_addr_t *dma_addr = (dma_addr_t *)pfns;
> > > +
> > > +		for_each_dma_page(i, j, npages, order) {
> > > +			if (WARN_ON_ONCE(i && order !=
> > > +					 hmm_pfn_to_map_order(pfns[i]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +			order = hmm_pfn_to_map_order(pfns[i]);
> > > +
> > > +			pages[j] = hmm_pfn_to_page(pfns[i]);
> > > +			if (WARN_ON_ONCE(is_zone_device_page(pages[j]))) {
> > > +				err = -EOPNOTSUPP;
> > > +				npages = i;
> > > +				goto err_unmap;
> > > +			}
> > > +
> > > +			set_page_dirty_lock(pages[j]);
> > > +			mark_page_accessed(pages[j]);
> > 
> > You can't do these, because you don't hold a page reference. They're also
> > not needed because hmm_range_fault goes thorugh the full mkwrite dance,
> > which takes care of these, unlike the gup family of functions.
> >
> 
> This is a left over from our existing userpte code and it does appear to
> be incorrect. Let me remove this and fixup our userptr code while I'm at
> it.

Ack.

> > > +	vas = vma_lookup(mm, start);
> > > +	if (!vas) {
> > > +		err = -ENOENT;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (end > vas->vm_end || start < vas->vm_start) {
> > > +		err = -EINVAL;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	if (!vma_is_anonymous(vas)) {
> > > +		err = -EBUSY;
> > > +		goto err_mmunlock;
> > > +	}
> > > +
> > > +	buf = kvcalloc(npages, 2 * sizeof(*migrate.src) + sizeof(*dma_addr) +
> > > +		       sizeof(*pages), GFP_KERNEL);
> > > +	if (!buf) {
> > > +		err = -ENOMEM;
> > > +		goto err_mmunlock;
> > > +	}
> > > +	dma_addr = buf + (2 * sizeof(*migrate.src) * npages);
> > > +	pages = buf + (2 * sizeof(*migrate.src) + sizeof(*dma_addr)) * npages;
> > > +
> > > +	zdd = drm_gpusvm_zdd_alloc(range);
> > > +	if (!zdd) {
> > > +		err = -ENOMEM;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	migrate.vma = vas;
> > > +	migrate.src = buf;
> > > +	migrate.dst = migrate.src + npages;
> > > +
> > > +	err = migrate_vma_setup(&migrate);
> > > +	if (err)
> > > +		goto err_free;
> > > +
> > > +	/*
> > > +	 * FIXME: Below cases, !migrate.cpages and migrate.cpages != npages, not
> > > +	 * always an error. Need to revisit possible cases and how to handle. We
> > > +	 * could prefault on migrate.cpages != npages via hmm_range_fault.
> 
> This is a bit stale, can update this comment.
> 
> > > +	 */
> > 
> > Yeah I think especially under contention partial migrations, at least back
> > to sram due to cpu faults, are pretty much expected. And you need to cope
> > somehow.
> > 
> 
> I have seen these pop if the IGT calls mlock on the memory. My thinking
> is migration to VRAM is basically optional and fallback to leaving range
> in SRAM if an error occurs rather than doing a partial migration. This
> is what currently happens so it is coped with.
> 
> If the memory is marked as must be in VRAM (NIY), well then the user
> program has done something wrong and can kill the app (akin to
> segfault).

Yeah SIGBUS for "must be in VRAM" sounds like ok semantics.

> > > +
> > > +	if (!migrate.cpages) {
> > > +		err = -EFAULT;
> > > +		goto err_free;
> > > +	}
> > > +
> > > +	if (migrate.cpages != npages) {
> > > +		err = -EBUSY;
> > > +		goto err_finalize;
> > > +	}

What I think is more fundamental is that I think this one here doesn't
work. For migrate_to_ram you cannot assume that you can always migrate the
entire block, I think to uphold the core mm forward progress rules we need
to allow partial migrations there. And I think your current code allows
that.

But that then means you also are stuck with partial migration state here.
That was the point I tried to make.

> > > +/**
> > > + * __drm_gpusvm_migrate_to_sram - Migrate GPU SVM range to SRAM (internal)
> > > + * @gpusvm: Pointer to the GPU SVM structure
> > > + * @vas: Pointer to the VM area structure
> > > + * @page: Pointer to the page for fault handling (can be NULL)
> > > + * @start: Start address of the migration range
> > > + * @end: End address of the migration range
> > > + *
> > > + * This internal function performs the migration of the specified GPU SVM range
> > > + * to SRAM. It sets up the migration, populates + dma maps SRAM PFNs, and
> > > + * invokes the driver-specific operations for migration to SRAM.
> > > + *
> > > + * Returns:
> > > + * 0 on success, negative error code on failure.
> > > + */
> > > +static int __drm_gpusvm_migrate_to_sram(struct drm_gpusvm *gpusvm,
> > > +					struct vm_area_struct *vas,
> > > +					struct page *page,
> > > +					u64 start, u64 end)
> > > +{
> > > +	struct migrate_vma migrate = {
> > > +		.vma		= vas,
> > > +		.pgmap_owner	= gpusvm->device_private_page_owner,
> > > +		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
> > > +		.fault_page	= page,
> > > +	};
> > > +	unsigned long npages;
> > > +	struct page **pages;
> > > +	dma_addr_t *dma_addr;
> > > +	void *buf;
> > > +	int i, err = 0;
> > > +
> > > +	mmap_assert_locked(gpusvm->mm);
> > 
> > That's the wrong mm, at least for the ->migrate_to_ram path. You might be
> > called on a anon mapping from a child process. That also means that the
> > vma you're looking at might have no relationship with anythign you're
> > tracking in your gpusvm.
> >
> 
> Hmm, as discussed [3] I haven't added tests with child processes yet.
> Let me do that and update the design as needed. This likely isn't
> correct as you say.
> 
> [3] https://patchwork.freedesktop.org/patch/610957/?series=137870&rev=1#comment_1111169 

Ack. More tests should definitely help here to figure out what's up, and
what's just me being confused.

> > > +/**
> > > + * drm_gpusvm_migrate_to_ram - Migrate GPU SVM range to RAM (page fault handler)
> > > + * @vmf: Pointer to the fault information structure
> > > + *
> > > + * This function is a page fault handler used to migrate a GPU SVM range to RAM.
> > > + * It retrieves the GPU SVM range information from the faulting page and invokes
> > > + * the internal migration function to migrate the range back to RAM.
> > > + *
> > > + * Returns:
> > > + * VM_FAULT_SIGBUS on failure, 0 on success.
> > > + */
> > > +static vm_fault_t drm_gpusvm_migrate_to_ram(struct vm_fault *vmf)
> > > +{
> > > +	struct drm_gpusvm_zdd *zdd = vmf->page->zone_device_data;
> > > +	int err;
> > > +
> > > +	err = __drm_gpusvm_migrate_to_sram(zdd->range->gpusvm,
> > 
> > So I think zdd->range doesn't work, because even within a single mm the
> > vma mapping a given piece of anon memory does not need to be unique, you
> > can duplicate them with mremap.
> > 
> 
> This is attached to a page, not a VMA. Both AMD and Nvidia drivers use a
> similar lookup mechanism.

Yeah the page->zone_device_data is fine. It's the zone_device_rage->range
which I think isn't ok.

> > So all you have here is the physical memory and the vma, which might or
> > might not be from the same process as gpusvm->mm.
> > 
> > Also the child process scenario means you using mmap_write on the fault
> > side doesn't stop all cpu faults migrating stuff back.
> > 
> > Somewhat aside, but I think that means amdkfd's svm_range->migration_mutex
> > is busted, because it's va based and so misses concurrently ongoing
> > different mappings moving physical storage around underneath.
> >
> 
> I think all of the above which falls into the fork() + child process
> issues which you have raise. Until I test this out I can't speak to this
> any level of confidence so I won't. Thanks for raising this issue and
> let me write test cases as discussed and educate myself. Once I do that,
> we can engage in further discussions.

I think fork + childs will still result in zdd->range being unique (albeit
confused about which mm). You need mremap of some of these mappings to
change the addresses and really cause confusion, which I /think/ (but
didn't test) is doable with a single process even and duplicating anon
memory mappings with mremap.

Cheers, Sima
-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

next prev parent reply	other threads:[~2024-09-02 11:53 UTC|newest]

Thread overview: 100+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-28  2:48 [RFC PATCH 00/28] Introduce GPU SVM and Xe SVM implementation Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 01/28] dma-buf: Split out dma fence array create into alloc and arm functions Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 02/28] drm/xe: Invalidate media_gt TLBs in PT code Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 03/28] drm/xe: Retry BO allocation Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 04/28] mm/migrate: Add migrate_device_vma_range Matthew Brost
2024-08-29  9:03   ` Daniel Vetter
2024-08-29 15:58     ` Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 05/28] drm/gpusvm: Add support for GPU Shared Virtual Memory Matthew Brost
2024-08-28 14:31   ` Daniel Vetter
2024-08-28 14:46     ` Christian König
2024-08-28 15:43       ` Matthew Brost
2024-08-28 16:06         ` Alex Deucher
2024-08-28 16:25         ` Daniel Vetter
2024-08-29 16:40           ` Matthew Brost
2024-09-02 11:29             ` Daniel Vetter
2024-08-30  5:00     ` Matthew Brost
2024-09-02 11:36       ` Daniel Vetter
2024-08-28 18:50   ` Daniel Vetter
2024-08-29 16:49     ` Matthew Brost
2024-09-02 11:40       ` Daniel Vetter
2024-08-29  9:16   ` Thomas Hellström
2024-08-29 17:45     ` Matthew Brost
2024-08-29 18:13       ` Matthew Brost
2024-08-29 19:18       ` Thomas Hellström
2024-08-29 20:56         ` Matthew Brost
2024-08-30  8:18           ` Thomas Hellström
2024-08-30 13:58             ` Matthew Brost
2024-09-02  9:57               ` Thomas Hellström
2024-08-30  9:57           ` Thomas Hellström
2024-08-30 13:47             ` Matthew Brost
2024-09-02  9:45               ` Thomas Hellström
2024-09-02 12:33           ` Daniel Vetter
2024-09-04 12:27             ` Thomas Hellström
2024-09-24  8:41               ` Simona Vetter
2024-08-30  1:35     ` Matthew Brost
2024-08-29  9:45   ` Daniel Vetter
2024-08-29 17:27     ` Matthew Brost
2024-09-02 11:53       ` Daniel Vetter [this message]
2024-09-02 17:03         ` Matthew Brost
2024-09-11 16:06           ` Matthew Brost
2024-08-30  9:16   ` Thomas Hellström
2024-09-02 12:20     ` Daniel Vetter
2024-09-06 18:41   ` Zeng, Oak
2024-09-24  9:25     ` Simona Vetter
2024-09-25 16:34       ` Zeng, Oak
2024-09-24 10:42   ` Thomas Hellström
2024-09-24 16:30     ` Matthew Brost
2024-09-25 21:12       ` Matthew Brost
2024-10-09 10:50   ` Thomas Hellström
2024-10-16  3:18     ` Matthew Brost
2024-10-16  6:27       ` Thomas Hellström
2024-10-16  8:24         ` Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 06/28] drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_SYSTEM_ALLOCATON flag Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 07/28] drm/xe: Add SVM init / fini to faulting VMs Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 08/28] drm/xe: Add dma_addr res cursor Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 09/28] drm/xe: Add SVM range invalidation Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 10/28] drm/gpuvm: Add DRM_GPUVA_OP_USER Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 11/28] drm/xe: Add (re)bind to SVM page fault handler Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 12/28] drm/xe: Add SVM garbage collector Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 13/28] drm/xe: Add unbind to " Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 14/28] drm/xe: Do not allow system allocator VMA unbind if the GPU has bindings Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 15/28] drm/xe: Enable system allocator uAPI Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 16/28] drm/xe: Add migrate layer functions for SVM support Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 17/28] drm/xe: Add SVM device memory mirroring Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 18/28] drm/xe: Add GPUSVM copy SRAM / VRAM vfunc functions Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 19/28] drm/xe: Update PT layer to understand ranges in VRAM Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 20/28] drm/xe: Add Xe SVM populate_vram_pfn vfunc Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 21/28] drm/xe: Add Xe SVM vram_release vfunc Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 22/28] drm/xe: Add BO flags required for SVM Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 23/28] drm/xe: Add SVM VRAM migration Matthew Brost
2024-08-28 16:06   ` Daniel Vetter
2024-08-28 18:22     ` Daniel Vetter
2024-08-29  9:24     ` Christian König
2024-08-29  9:53       ` Thomas Hellström
2024-08-29 11:02         ` Daniel Vetter
2024-08-29 22:12           ` Matthew Brost
2024-08-29 22:23             ` Matthew Brost
2024-09-02 11:01             ` Christian König
2024-09-02 12:50               ` Daniel Vetter
2024-09-02 12:48             ` Daniel Vetter
2024-09-02 22:20               ` Matthew Brost
2024-09-03  8:07                 ` Simona Vetter
2024-08-29 14:30         ` Christian König
2024-08-29 21:53           ` Matthew Brost
2024-08-29 21:48       ` Matthew Brost
2024-09-02 13:02         ` Daniel Vetter
2024-08-28  2:48 ` [RFC PATCH 24/28] drm/xe: Basic SVM BO eviction Matthew Brost
2024-08-29 10:14   ` Daniel Vetter
2024-08-29 15:55     ` Matthew Brost
2024-09-02 13:05       ` Daniel Vetter
2024-08-28  2:48 ` [RFC PATCH 25/28] drm/xe: Add SVM debug Matthew Brost
2024-08-28  2:48 ` [RFC PATCH 26/28] drm/xe: Add modparam for SVM notifier size Matthew Brost
2024-08-28  2:49 ` [RFC PATCH 27/28] drm/xe: Add modparam for SVM prefault Matthew Brost
2024-08-28  2:49 ` [RFC PATCH 28/28] drm/gpusvm: Ensure all pages migrated upon eviction Matthew Brost
2024-08-28  2:55 ` ✓ CI.Patch_applied: success for Introduce GPU SVM and Xe SVM implementation Patchwork
2024-08-28  2:55 ` ✗ CI.checkpatch: warning " Patchwork
2024-08-28  2:56 ` ✗ CI.KUnit: failure " Patchwork
2024-09-24  9:16 ` [RFC PATCH 00/28] " Simona Vetter
2024-09-24 19:36   ` Matthew Brost
2024-09-25 11:41     ` Simona Vetter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZtWnKvaV964EA6mq@phenom.ffwll.local \
    --to=daniel.vetter@ffwll.ch \
    --cc=airlied@gmail.com \
    --cc=christian.koenig@amd.com \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox