Re: [PATCH] drm/xe: Thread prefetch of SVM ranges

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
	<himal.prasad.ghimiray@intel.com>, <michal.mrozek@intel.com>
Subject: Re: [PATCH] drm/xe: Thread prefetch of SVM ranges
Date: Tue, 17 Jun 2025 07:43:22 -0700	[thread overview]
Message-ID: <aFF/CsuVpyM7HTFY@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <813b13e287f6443f4518d8055f64f6539ca34b0e.camel@linux.intel.com>

On Tue, Jun 17, 2025 at 02:43:27PM +0200, Thomas Hellström wrote:
> Hi, Matt
> 
> On Sun, 2025-06-15 at 23:47 -0700, Matthew Brost wrote:
> > The migrate_vma_* functions are very CPU-intensive; as a result,
> > prefetching SVM ranges is limited by CPU performance rather than
> > paging
> > copy engine bandwidth. To accelerate SVM range prefetching, the step
> > that calls migrate_vma_* is now threaded. This uses a dedicated
> > workqueue, as the page fault workqueue cannot be shared without
> > risking
> > deadlocks—due to the prefetch IOCTL holding the VM lock in write mode
> > while work items in the page fault workqueue also require the VM
> > lock.
> > 
> > The prefetch workqueue is currently allocated in GT, similar to the
> > page
> > fault workqueue. While this is likely not the ideal location for
> > either,
> > refactoring will be deferred to a later patch.
> > 
> > Running xe_exec_system_allocator --r prefetch-benchmark, which tests
> > 64MB prefetches, shows an increase from ~4.35 GB/s to 12.25 GB/s with
> > this patch on drm-tip. Enabling high SLPC further increases
> > throughput
> > to ~15.25 GB/s, and combining SLPC with ULLS raises it to ~16 GB/s.
> > Both
> > of these optimizations are upcoming.
> 
> I looked at this again. I still think there are some optimizations that
> could be done in addition to Francois series to lessen the impact of
> this, but nevertheless to quickly get the real workload running on the
> GPU again when used on a single-client system.
> 
> I raised a question with the maintainers whether we should keep
> optimizations like this that improves performance for one client at the
> cost of others behind a kernel konfig, and also whether to expose
> parameters like the width of the queue both for this purpose and for
> parallel faults as sysfs knobs.
> 

sysfs knobs sounds reasonable to me, and perhaps just default to 2
threads and live with less than peak bandwidth for prefetch now until
Francios series lands?

> Some comments inline:
> 
> > 
> > v2:
> >  - Use dedicated prefetch workqueue
> >  - Pick dedicated prefetch thread count based on profiling
> >  - Skip threaded prefetch for only 1 range or if prefetching to SRAM
> >  - Fully tested
> > 
> > Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_gt_pagefault.c |  31 ++++++-
> >  drivers/gpu/drm/xe/xe_gt_types.h     |   2 +
> >  drivers/gpu/drm/xe/xe_vm.c           | 128 +++++++++++++++++++++----
> > --
> >  3 files changed, 135 insertions(+), 26 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > index e2d975b2fddb..941cca3371f2 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c
> > @@ -400,6 +400,8 @@ static void pagefault_fini(void *arg)
> >  
> >  	destroy_workqueue(gt->usm.acc_wq);
> >  	destroy_workqueue(gt->usm.pf_wq);
> > +	if (gt->usm.prefetch_wq)
> > +		destroy_workqueue(gt->usm.prefetch_wq);
> >  }
> >  
> >  static int xe_alloc_pf_queue(struct xe_gt *gt, struct pf_queue
> > *pf_queue)
> > @@ -438,10 +440,24 @@ static int xe_alloc_pf_queue(struct xe_gt *gt,
> > struct pf_queue *pf_queue)
> >  	return 0;
> >  }
> >  
> > +static int prefetch_thread_count(struct xe_device *xe)
> > +{
> > +	if (!IS_DGFX(xe))
> > +		return 0;
> > +
> > +	/*
> > +	 * Based on profiling large aligned 2M prefetches, this is
> > the optimial
> > +	 * number of threads on BMG (only platform currently
> > supported). This
> > +	 * should be tuned for each supported platform and can
> > change on per
> > +	 * platform basis as optimizations land (e.g., large device
> > pages).
> > +	 */
> > +	return 5;
> > +}
> > +
> >  int xe_gt_pagefault_init(struct xe_gt *gt)
> >  {
> >  	struct xe_device *xe = gt_to_xe(gt);
> > -	int i, ret = 0;
> > +	int i, count, ret = 0;
> >  
> >  	if (!xe->info.has_usm)
> >  		return 0;
> > @@ -462,10 +478,23 @@ int xe_gt_pagefault_init(struct xe_gt *gt)
> >  	if (!gt->usm.pf_wq)
> >  		return -ENOMEM;
> >  
> > +	count = prefetch_thread_count(xe);
> > +	if (count) {
> > +		gt->usm.prefetch_wq =
> > alloc_workqueue("xe_gt_prefetch_work_queue",
> > +						      WQ_UNBOUND |
> > WQ_HIGHPRI,
> > +						      count);
> 
> Can we avoid WQ_HIGHPRI here without losing performance?
> Also if count gets near the number of available high-performance cores,
> I suspect we might see less effect of parallelizing like this?
> 

Let me test that out today and give some numbers breakdown of bandwidth
per thread count / effect of WQ_HIGHPRI.

> 
> > +		if (!gt->usm.prefetch_wq) {
> > +			destroy_workqueue(gt->usm.pf_wq);
> > +			return -ENOMEM;
> > +		}
> > +	}
> > +
> >  	gt->usm.acc_wq =
> > alloc_workqueue("xe_gt_access_counter_work_queue",
> >  					 WQ_UNBOUND | WQ_HIGHPRI,
> >  					 NUM_ACC_QUEUE);
> >  	if (!gt->usm.acc_wq) {
> > +		if (gt->usm.prefetch_wq)
> > +			destroy_workqueue(gt->usm.prefetch_wq);
> >  		destroy_workqueue(gt->usm.pf_wq);
> >  		return -ENOMEM;
> >  	}
> > diff --git a/drivers/gpu/drm/xe/xe_gt_types.h
> > b/drivers/gpu/drm/xe/xe_gt_types.h
> > index 7def0959da35..d9ba4921b8ce 100644
> > --- a/drivers/gpu/drm/xe/xe_gt_types.h
> > +++ b/drivers/gpu/drm/xe/xe_gt_types.h
> > @@ -239,6 +239,8 @@ struct xe_gt {
> >  		u16 reserved_bcs_instance;
> >  		/** @usm.pf_wq: page fault work queue, unbound, high
> > priority */
> >  		struct workqueue_struct *pf_wq;
> > +		/** @usm.prefetch_wq: prefetch work queue, unbound,
> > high priority */
> > +		struct workqueue_struct *prefetch_wq;
> >  		/** @usm.acc_wq: access counter work queue, unbound,
> > high priority */
> >  		struct workqueue_struct *acc_wq;
> >  		/**
> > diff --git a/drivers/gpu/drm/xe/xe_vm.c b/drivers/gpu/drm/xe/xe_vm.c
> > index 6ef8c4dab647..1ae8e03aead6 100644
> > --- a/drivers/gpu/drm/xe/xe_vm.c
> > +++ b/drivers/gpu/drm/xe/xe_vm.c
> > @@ -2885,52 +2885,130 @@ static int check_ufence(struct xe_vma *vma)
> >  	return 0;
> >  }
> >  
> > -static int prefetch_ranges(struct xe_vm *vm, struct xe_vma_op *op)
> > +struct prefetch_thread {
> > +	struct work_struct work;
> > +	struct drm_gpusvm_ctx *ctx;
> > +	struct xe_vma *vma;
> > +	struct xe_svm_range *svm_range;
> > +	struct xe_tile *tile;
> > +	u32 region;
> > +	int err;
> > +};
> > +
> > +static void prefetch_work_func(struct work_struct *w)
> >  {
> > -	bool devmem_possible = IS_DGFX(vm->xe) &&
> > IS_ENABLED(CONFIG_DRM_XE_DEVMEM_MIRROR);
> > -	struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
> > +	struct prefetch_thread *thread =
> > +		container_of(w, struct prefetch_thread, work);
> > +	struct xe_vma *vma = thread->vma;
> > +	struct xe_vm *vm = xe_vma_vm(vma);
> > +	struct xe_svm_range *svm_range = thread->svm_range;
> > +	u32 region = thread->region;
> > +	struct xe_tile *tile = thread->tile;
> >  	int err = 0;
> >  
> > -	struct xe_svm_range *svm_range;
> > +	if (!region) {
> > +		xe_svm_range_migrate_to_smem(vm, svm_range);
> > +	} else if (xe_svm_range_needs_migrate_to_vram(svm_range,
> > vma, region)) {
> > +		err = xe_svm_alloc_vram(vm, tile, svm_range, thread-
> > >ctx);
> > +		if (err) {
> > +			drm_dbg(&vm->xe->drm,
> > +				"VRAM allocation failed, retry from
> > userspace, asid=%u, gpusvm=%p, errno=%pe\n",
> > +				vm->usm.asid, &vm->svm.gpusvm,
> > ERR_PTR(err));
> > +			thread->err = -ENODATA;
> > +			return;
> > +		}
> > +		xe_svm_range_debug(svm_range, "PREFETCH - RANGE
> > MIGRATED TO VRAM");
> > +	}
> > +
> > +	err = xe_svm_range_get_pages(vm, svm_range, thread->ctx);
> > +	if (err) {
> > +		drm_dbg(&vm->xe->drm, "Get pages failed, asid=%u,
> > gpusvm=%p, errno=%pe\n",
> > +			vm->usm.asid, &vm->svm.gpusvm,
> > ERR_PTR(err));
> > +		if (err == -EOPNOTSUPP || err == -EFAULT || err == -
> > EPERM)
> > +			err = -ENODATA;
> > +		thread->err = err;
> > +		return;
> > +	}
> > +
> > +	xe_svm_range_debug(svm_range, "PREFETCH - RANGE GET PAGES
> > DONE");
> > +}
> > +
> > +static int prefetch_ranges(struct xe_vm *vm, struct xe_vma_op *op)
> > +{
> > +	struct xe_vma *vma = gpuva_to_vma(op->base.prefetch.va);
> > +	u32 j, region = op->prefetch_range.region;
> >  	struct drm_gpusvm_ctx ctx = {};
> > -	struct xe_tile *tile;
> > +	struct prefetch_thread stack_thread;
> > +	struct xe_svm_range *svm_range;
> > +	struct xarray prefetches;
> > +	bool sram = region_to_mem_type[region] == XE_PL_TT;
> > +	struct xe_tile *tile = sram ? xe_device_get_root_tile(vm-
> > >xe) :
> > +		&vm->xe->tiles[region_to_mem_type[region] -
> > XE_PL_VRAM0];
> >  	unsigned long i;
> > -	u32 region;
> > +	bool devmem_possible = IS_DGFX(vm->xe) &&
> > +		IS_ENABLED(CONFIG_DRM_XE_DEVMEM_MIRROR);
> > +	bool skip_threads = op->prefetch_range.ranges_count == 1 ||
> > sram;
> > +	struct prefetch_thread *thread = skip_threads ?
> > &stack_thread : NULL;
> > +	int err = 0;
> >  
> >  	if (!xe_vma_is_cpu_addr_mirror(vma))
> >  		return 0;
> >  
> > -	region = op->prefetch_range.region;
> > +	if (!skip_threads)
> > +		xa_init_flags(&prefetches, XA_FLAGS_ALLOC);
> >  
> >  	ctx.read_only = xe_vma_read_only(vma);
> >  	ctx.devmem_possible = devmem_possible;
> >  	ctx.check_pages_threshold = devmem_possible ? SZ_64K : 0;
> >  
> > -	/* TODO: Threading the migration */
> >  	xa_for_each(&op->prefetch_range.range, i, svm_range) {
> > -		if (!region)
> > -			xe_svm_range_migrate_to_smem(vm, svm_range);
> > +		if (!skip_threads) {
> > +			thread = kmalloc(sizeof(*thread),
> > GFP_KERNEL);
> > +			if (!thread)
> > +				goto wait_threads;
> >  
> > -		if (xe_svm_range_needs_migrate_to_vram(svm_range,
> > vma, region)) {
> > -			tile = &vm->xe-
> > >tiles[region_to_mem_type[region] - XE_PL_VRAM0];
> > -			err = xe_svm_alloc_vram(vm, tile, svm_range,
> > &ctx);
> > +			err = xa_alloc(&prefetches, &j, thread,
> > xa_limit_32b,
> > +				       GFP_KERNEL);
> 
> No locking (like in xarray) required here since prefetches is a stack
> variable, and no reason to expect cache trashing so use a linked list
> or simple array instead of an xarray?
> 

I think a simple array would be a good choice. Let me refactor this.

> 
> >  			if (err) {
> > -				drm_dbg(&vm->xe->drm, "VRAM
> > allocation failed, retry from userspace, asid=%u, gpusvm=%p,
> > errno=%pe\n",
> > -					vm->usm.asid, &vm-
> > >svm.gpusvm, ERR_PTR(err));
> > -				return -ENODATA;
> > +				kfree(thread);
> > +				goto wait_threads;
> >  			}
> > -			xe_svm_range_debug(svm_range, "PREFETCH -
> > RANGE MIGRATED TO VRAM");
> >  		}
> >  
> > -		err = xe_svm_range_get_pages(vm, svm_range, &ctx);
> > -		if (err) {
> > -			drm_dbg(&vm->xe->drm, "Get pages failed,
> > asid=%u, gpusvm=%p, errno=%pe\n",
> > -				vm->usm.asid, &vm->svm.gpusvm,
> > ERR_PTR(err));
> > -			if (err == -EOPNOTSUPP || err == -EFAULT ||
> > err == -EPERM)
> > -				err = -ENODATA;
> > -			return err;
> > +		INIT_WORK(&thread->work, prefetch_work_func);
> > +		thread->ctx = &ctx;
> > +		thread->vma = vma;
> > +		thread->svm_range = svm_range;
> > +		thread->tile = tile;
> > +		thread->region = region;
> > +		thread->err = 0;
> > +
> > +		if (skip_threads) {
> > +			prefetch_work_func(&thread->work);
> > +			if (thread->err)
> > +				return thread->err;
> > +		} else {
> > +			/*
> > +			 * Prefetch uses a dedicated workqueue, as
> > the page
> > +			 * fault workqueue cannot be shared without
> > risking
> > +			 * deadlocks—due to holding the VM lock in
> > write mode
> > +			 * here while work items in the page fault
> > workqueue
> > +			 * also require the VM lock.
> > +			 */
> 
> Hmm. This is weird. In principle, a parallel fault handler could be
> processing the same range simultaneously, and blow things up but since
> we hold the vm lock on behalf of the threads this doesn't happen. But
> if we were to properly annotate, for example drm_gpusvm_get_pages()
> with drm_gpusvm_driver_lock_held(), then that would assert. I don't
> think "let's hold the vm lock on behalf of the threads" is acceptable,
> really, unless we can find other examples in the kernel or preferrably
> even in drm.
> 
> This means we need some form of finer-grained locking in gpusvm, like
> for example a per-range lock, to be able to relax the vm lock to read
> mode both in the fault handler and here?
> 

This is the ultimate goal—to allow per-VM parallel faults. I hacked
together finer-grained locking a while back, but held off on posting it
until madvise and multi-GPU support landed, to avoid making it harder
for those features to merge.

I can post that refactor now if you think this a prerequisite to this
series.

> 
> > +			queue_work(tile->primary_gt-
> > >usm.prefetch_wq,
> > +				   &thread->work);
> > +		}
> > +	}
> > +
> > +wait_threads:
> > +	if (!skip_threads) {
> > +		xa_for_each(&prefetches, i, thread) {
> > +			flush_work(&thread->work);
> 
> Similarly this adds an interruptible wait. Ideally if we hit a signal
> here we'd like to just be able to forget about the threads and let them
> finish while we return?
> 

Is flush_work interruptible? This is undocumented and but from a look at
the code I don't believe it is. I agree ideally we'd want this be
interruptible but unsure if this is possible with the current workqueue
code.

Matt

> Thanks, 
> Thomas
> 
> 
> 
> > +			if (thread->err && (!err || err == -
> > ENODATA))
> > +				err = thread->err;
> > +			kfree(thread);
> >  		}
> > -		xe_svm_range_debug(svm_range, "PREFETCH - RANGE GET
> > PAGES DONE");
> > +		xa_destroy(&prefetches);
> >  	}
> >  
> >  	return err;
>

next prev parent reply	other threads:[~2025-06-17 14:42 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-16  6:47 [PATCH] drm/xe: Thread prefetch of SVM ranges Matthew Brost
2025-06-16  8:28 ` Thomas Hellström
2025-06-16  8:58   ` Matthew Brost
2025-06-16  9:24     ` Thomas Hellström
2025-06-16 12:06       ` Mrozek, Michal
2025-06-17 14:30         ` Matthew Brost
2025-06-16 11:20     ` Thomas Hellström
2025-06-17 17:10       ` Matthew Brost
2025-06-16 11:51 ` ✓ CI.KUnit: success for drm/xe: Thread prefetch of SVM ranges (rev2) Patchwork
2025-06-16 12:32 ` ✓ Xe.CI.BAT: " Patchwork
2025-06-16 17:53 ` ✗ Xe.CI.Full: failure " Patchwork
2025-06-17 12:43 ` [PATCH] drm/xe: Thread prefetch of SVM ranges Thomas Hellström
2025-06-17 14:43   ` Matthew Brost [this message]
  -- strict thread matches above, loose matches on Subject: below --
2025-05-28 17:27 Matthew Brost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aFF/CsuVpyM7HTFY@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=michal.mrozek@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.