From: Matthew Brost <matthew.brost@intel.com>
To: "Summers, Stuart" <stuart.summers@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
"Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>,
"Yadav, Arvind" <arvind.yadav@intel.com>,
"thomas.hellstrom@linux.intel.com"
<thomas.hellstrom@linux.intel.com>,
"Dugast, Francois" <francois.dugast@intel.com>
Subject: Re: [PATCH v3 00/25] CPU binds and ULLS on migration queue
Date: Tue, 10 Mar 2026 15:17:06 -0700 [thread overview]
Message-ID: <abCYYllIzH3I3N8h@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <545e42a856cbbd924619f4b364242ba1777af0ea.camel@intel.com>
On Thu, Mar 05, 2026 at 03:56:50PM -0700, Summers, Stuart wrote:
> One question I have reading through some of the ULLS patches, why do we
> need to do this in the kernel? This adds quite a bit of complexity here
> that IMO might be a better fit for userspace migration, particularly
> for the userspace that already handles ULLS like L0. What is the
> benefit of doing this in the kernel vs adding a new API to allow some
> kind of opt in for migration in the UMD?
>
I cover some of this in the cover letter, but I’ll type it out again.
Page faults are handled by the kernel and trigger migrations. For
something like SVM, where we migrate chunks of 4K, 64K, or 2M, this
makes a large difference in performance.
Our SVM stats show that a GuC-based submission has roughly a 30µs
overhead. Now consider the individual copy time on BMG with the fastest
available PCIe: 15µs, 20µs, or 60µs for 4K, 64K, and 2M respectively.
The context-switch overhead is huge in the critical path. With future
devices and faster PCIe links, the ratio between context-switch cost and
copy time becomes even worse.
We also have compute benchmarks measuring pagefault bandwidth, and
enabling ULLS (via modparam) shows multi-GB/s bandwidth improvements.
François and I can provide exact numbers, but we already have a ton of
performance fixes in flight that collectively result in roughly 10×
speedups in compute benchmarks. If we can get SVM operating close to the
PCIe line rate, this becomes a very useful feature for
applications—we’re basically already hitting line rate on BMG after all
the performance improvements land.
I’d also argue that it really isn’t all that complex. It largely fits
into existing concepts and is layered quite nicely.
There are prefetch APIs for SVM, but those require applications to
explicitly call them rather than relying on the device to fault in a
malloc(). Those APIs also speed up when ULLS is enabled. Also, if it
isn’t clear—the only option for SVM is KMD‑driven migration, since we
also have to modify the CPU page tables (i.e., user space can’t just
issue a copy command to migrate a malloc’d buffer to the device).
Matt
> The CPU binding generally even outside of the ULLS piece makes sense to
> me, so I don't think this is blocking particularly. But it would be
> nice to have a little more detail on the above before we move forward
> here.
>
> Thanks,
> Stuart
>
> On Fri, 2026-02-27 at 17:34 -0800, Matthew Brost wrote:
> > We now have data demonstrating the need for CPU binds and ULLS on the
> > migration queue, based on results generated from [1].
> >
> > On BMG, measurements show that when the GPU is continuously
> > processing
> > faults, copy jobs run approximately 30–40µs faster (depending on the
> > test case) with ULLS compared to traditional GuC submission with SLPC
> > enabled on the migration queue. Startup from a cold GPU shows an even
> > larger speedup. Given the critical nature of fault performance, ULLS
> > appears to be a worthwhile feature.
> >
> > In addition to driver telemetry, UMD compute benchmarks consistently
> > show over 1GB/s improvement in pagefault benchmarks with ULLS
> > enabled.
> >
> > ULLS will consume more power (not yet measured) due to a continuously
> > running batch on the paging engine. However, compute UMDs already do
> > this on engines exposed to users, so this seems like a worthwhile
> > tradeoff. To mitigate power concerns, ULLS will exit after a period
> > of
> > time in which no faults have been processed.
> >
> > CPU binds are required for ULLS to function, as the migration queue
> > needs exclusive access to the paging hardware engine. Thus, CPU binds
> > are included here.
> >
> > Beyond being a requirement for ULLS, CPU binds should also reduce
> > VM-bind latency, provide clearer multi-tile and TLB-invalidation
> > layering, reduce pressure on GuC during fault storms as it is
> > bypassed,
> > and decouple kernel binds from unrelated copy/clear jobs—especially
> > beneficial when faults are serviced in parallel. In a parallel-
> > faulting
> > test case, average bind time was reduced by approximately 15µs. In
> > the
> > worst case, 2MB copy time (~60–140µs) × (number of pagefault threads
> > −
> > 1) of latency would otherwise be added to a single fault. Reducing
> > this
> > latency increases overall throughput of the fault handler.
> >
> > This series can be merged in phases:
> >
> > Phase 1: CPU binds (patches 1–13)
> > Phase 2: CPU-bind components and multi-tile relayers (patches 14–17)
> > Phase 3: ULLS on the migration execution queue (patches 18–25)
> >
> > v2:
> > - Use delayed worker to exit ULLS mode in an effort to save on power
> > - Various other cleanups
> > v3:
> > - CPU bind component, multi-tile relayer
> > - Split CPU bind patches in many small patches
> >
> > Matt
> >
> > [1] https://patchwork.freedesktop.org/series/149811/
> >
> > Matthew Brost (25):
> > drm/xe: Drop struct xe_migrate_pt_update argument from
> > populate/clear
> > vfuns
> > drm/xe: Add xe_migrate_update_pgtables_cpu_execute helper
> > drm/xe: Decouple exec queue idle check from LRC
> > drm/xe: Add job count to GuC exec queue snapshot
> > drm/xe: Update xe_bo_put_deferred arguments to include writeback
> > flag
> > drm/xe: Add XE_BO_FLAG_PUT_VM_ASYNC
> > drm/xe: Update scheduler job layer to support PT jobs
> > drm/xe: Add helpers to access PT ops
> > drm/xe: Add struct xe_pt_job_ops
> > drm/xe: Update GuC submission backend to run PT jobs
> > drm/xe: Store level in struct xe_vm_pgtable_update
> > drm/xe: Don't use migrate exec queue for page fault binds
> > drm/xe: Enable CPU binds for jobs
> > drm/xe: Remove unused arguments from xe_migrate_pt_update_ops
> > drm/xe: Make bind queues operate cross-tile
> > drm/xe: Add CPU bind layer
> > drm/xe: Add device flag to enable PT mirroring across tiles
> > drm/xe: Add xe_hw_engine_write_ring_tail
> > drm/xe: Add ULLS support to LRC
> > drm/xe: Add ULLS migration job support to migration layer
> > drm/xe: Add MI_SEMAPHORE_WAIT instruction defs
> > drm/xe: Add ULLS migration job support to ring ops
> > drm/xe: Add ULLS migration job support to GuC submission
> > drm/xe: Enter ULLS for migration jobs upon page fault or SVM
> > prefetch
> > drm/xe: Add modparam to enable / disable ULLS on migrate queue
> >
> > drivers/gpu/drm/xe/Makefile | 1 +
> > .../gpu/drm/xe/instructions/xe_mi_commands.h | 6 +
> > drivers/gpu/drm/xe/xe_bo.c | 8 +-
> > drivers/gpu/drm/xe/xe_bo.h | 11 +-
> > drivers/gpu/drm/xe/xe_bo_types.h | 2 -
> > drivers/gpu/drm/xe/xe_cpu_bind.c | 296 +++++++
> > drivers/gpu/drm/xe/xe_cpu_bind.h | 118 +++
> > drivers/gpu/drm/xe/xe_debugfs.c | 1 +
> > drivers/gpu/drm/xe/xe_defaults.h | 1 +
> > drivers/gpu/drm/xe/xe_device.c | 17 +-
> > drivers/gpu/drm/xe/xe_device_types.h | 11 +
> > drivers/gpu/drm/xe/xe_drm_client.c | 2 +-
> > drivers/gpu/drm/xe/xe_exec_queue.c | 163 ++--
> > drivers/gpu/drm/xe/xe_exec_queue.h | 18 +-
> > drivers/gpu/drm/xe/xe_exec_queue_types.h | 21 +-
> > drivers/gpu/drm/xe/xe_guc_submit.c | 82 +-
> > drivers/gpu/drm/xe/xe_guc_submit_types.h | 2 +
> > drivers/gpu/drm/xe/xe_hw_engine.c | 10 +
> > drivers/gpu/drm/xe/xe_hw_engine.h | 1 +
> > drivers/gpu/drm/xe/xe_lrc.c | 51 ++
> > drivers/gpu/drm/xe/xe_lrc.h | 3 +
> > drivers/gpu/drm/xe/xe_lrc_types.h | 4 +
> > drivers/gpu/drm/xe/xe_migrate.c | 585 +++++--------
> > drivers/gpu/drm/xe/xe_migrate.h | 93 +--
> > drivers/gpu/drm/xe/xe_module.c | 4 +
> > drivers/gpu/drm/xe/xe_module.h | 1 +
> > drivers/gpu/drm/xe/xe_pagefault.c | 3 +
> > drivers/gpu/drm/xe/xe_pci.c | 2 +
> > drivers/gpu/drm/xe/xe_pci_types.h | 1 +
> > drivers/gpu/drm/xe/xe_pt.c | 773 +++++++++++-----
> > --
> > drivers/gpu/drm/xe/xe_pt.h | 12 +-
> > drivers/gpu/drm/xe/xe_pt_types.h | 49 +-
> > drivers/gpu/drm/xe/xe_ring_ops.c | 31 +
> > drivers/gpu/drm/xe/xe_sched_job.c | 100 ++-
> > drivers/gpu/drm/xe/xe_sched_job_types.h | 36 +-
> > drivers/gpu/drm/xe/xe_sync.c | 20 +-
> > drivers/gpu/drm/xe/xe_tlb_inval_job.c | 28 +-
> > drivers/gpu/drm/xe/xe_tlb_inval_job.h | 4 +-
> > drivers/gpu/drm/xe/xe_vm.c | 241 +++---
> > drivers/gpu/drm/xe/xe_vm.h | 3 +
> > drivers/gpu/drm/xe/xe_vm_types.h | 22 +-
> > 41 files changed, 1658 insertions(+), 1179 deletions(-)
> > create mode 100644 drivers/gpu/drm/xe/xe_cpu_bind.c
> > create mode 100644 drivers/gpu/drm/xe/xe_cpu_bind.h
> >
>
next prev parent reply other threads:[~2026-03-10 22:17 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-28 1:34 [PATCH v3 00/25] CPU binds and ULLS on migration queue Matthew Brost
2026-02-28 1:34 ` [PATCH v3 01/25] drm/xe: Drop struct xe_migrate_pt_update argument from populate/clear vfuns Matthew Brost
2026-03-05 14:17 ` Francois Dugast
2026-02-28 1:34 ` [PATCH v3 02/25] drm/xe: Add xe_migrate_update_pgtables_cpu_execute helper Matthew Brost
2026-03-05 14:39 ` Francois Dugast
2026-02-28 1:34 ` [PATCH v3 03/25] drm/xe: Decouple exec queue idle check from LRC Matthew Brost
2026-03-02 20:50 ` Summers, Stuart
2026-03-02 21:02 ` Matthew Brost
2026-03-03 21:26 ` Summers, Stuart
2026-03-03 22:42 ` Matthew Brost
2026-03-03 22:54 ` Summers, Stuart
2026-02-28 1:34 ` [PATCH v3 04/25] drm/xe: Add job count to GuC exec queue snapshot Matthew Brost
2026-03-02 20:50 ` Summers, Stuart
2026-02-28 1:34 ` [PATCH v3 05/25] drm/xe: Update xe_bo_put_deferred arguments to include writeback flag Matthew Brost
2026-04-01 12:20 ` Francois Dugast
2026-04-01 22:39 ` Matthew Brost
2026-02-28 1:34 ` [PATCH v3 06/25] drm/xe: Add XE_BO_FLAG_PUT_VM_ASYNC Matthew Brost
2026-04-01 12:22 ` Francois Dugast
2026-04-01 22:38 ` Matthew Brost
2026-02-28 1:34 ` [PATCH v3 07/25] drm/xe: Update scheduler job layer to support PT jobs Matthew Brost
2026-03-03 22:50 ` Summers, Stuart
2026-03-03 23:00 ` Matthew Brost
2026-02-28 1:34 ` [PATCH v3 08/25] drm/xe: Add helpers to access PT ops Matthew Brost
2026-04-07 15:22 ` Francois Dugast
2026-02-28 1:34 ` [PATCH v3 09/25] drm/xe: Add struct xe_pt_job_ops Matthew Brost
2026-03-03 23:26 ` Summers, Stuart
2026-03-03 23:28 ` Matthew Brost
2026-02-28 1:34 ` [PATCH v3 10/25] drm/xe: Update GuC submission backend to run PT jobs Matthew Brost
2026-03-03 23:28 ` Summers, Stuart
2026-03-04 0:26 ` Matthew Brost
2026-03-04 20:43 ` Summers, Stuart
2026-03-04 21:53 ` Matthew Brost
2026-03-05 20:24 ` Summers, Stuart
2026-02-28 1:34 ` [PATCH v3 11/25] drm/xe: Store level in struct xe_vm_pgtable_update Matthew Brost
2026-03-03 23:44 ` Summers, Stuart
2026-02-28 1:34 ` [PATCH v3 12/25] drm/xe: Don't use migrate exec queue for page fault binds Matthew Brost
2026-02-28 1:34 ` [PATCH v3 13/25] drm/xe: Enable CPU binds for jobs Matthew Brost
2026-02-28 1:34 ` [PATCH v3 14/25] drm/xe: Remove unused arguments from xe_migrate_pt_update_ops Matthew Brost
2026-02-28 1:34 ` [PATCH v3 15/25] drm/xe: Make bind queues operate cross-tile Matthew Brost
2026-02-28 1:34 ` [PATCH v3 16/25] drm/xe: Add CPU bind layer Matthew Brost
2026-02-28 1:34 ` [PATCH v3 17/25] drm/xe: Add device flag to enable PT mirroring across tiles Matthew Brost
2026-02-28 1:34 ` [PATCH v3 18/25] drm/xe: Add xe_hw_engine_write_ring_tail Matthew Brost
2026-02-28 1:34 ` [PATCH v3 19/25] drm/xe: Add ULLS support to LRC Matthew Brost
2026-03-05 20:21 ` Francois Dugast
2026-02-28 1:34 ` [PATCH v3 20/25] drm/xe: Add ULLS migration job support to migration layer Matthew Brost
2026-03-05 23:34 ` Summers, Stuart
2026-03-09 23:11 ` Matthew Brost
2026-02-28 1:34 ` [PATCH v3 21/25] drm/xe: Add MI_SEMAPHORE_WAIT instruction defs Matthew Brost
2026-02-28 1:34 ` [PATCH v3 22/25] drm/xe: Add ULLS migration job support to ring ops Matthew Brost
2026-02-28 1:34 ` [PATCH v3 23/25] drm/xe: Add ULLS migration job support to GuC submission Matthew Brost
2026-02-28 1:35 ` [PATCH v3 24/25] drm/xe: Enter ULLS for migration jobs upon page fault or SVM prefetch Matthew Brost
2026-02-28 1:35 ` [PATCH v3 25/25] drm/xe: Add modparam to enable / disable ULLS on migrate queue Matthew Brost
2026-03-05 22:59 ` Summers, Stuart
2026-04-01 22:44 ` Matthew Brost
2026-02-28 1:43 ` ✗ CI.checkpatch: warning for CPU binds and ULLS on migration queue (rev3) Patchwork
2026-02-28 1:44 ` ✓ CI.KUnit: success " Patchwork
2026-02-28 2:32 ` ✓ Xe.CI.BAT: " Patchwork
2026-02-28 13:59 ` ✗ Xe.CI.FULL: failure " Patchwork
2026-03-02 17:54 ` Summers, Stuart
2026-03-02 18:13 ` Matthew Brost
2026-03-05 22:56 ` [PATCH v3 00/25] CPU binds and ULLS on migration queue Summers, Stuart
2026-03-10 22:17 ` Matthew Brost [this message]
2026-03-20 15:31 ` Thomas Hellström
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=abCYYllIzH3I3N8h@lstrano-desk.jf.intel.com \
--to=matthew.brost@intel.com \
--cc=arvind.yadav@intel.com \
--cc=francois.dugast@intel.com \
--cc=himal.prasad.ghimiray@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=stuart.summers@intel.com \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox