From: Maciej Patelczyk <maciej.patelczyk@intel.com>
To: Matthew Brost <matthew.brost@intel.com>,
<intel-xe@lists.freedesktop.org>
Cc: <stuart.summers@intel.com>, <arvind.yadav@intel.com>,
<himal.prasad.ghimiray@intel.com>,
<thomas.hellstrom@linux.intel.com>, <francois.dugast@intel.com>
Subject: Re: [PATCH v4 08/12] drm/xe: Chain page faults via queue-resident cache to avoid fault storms
Date: Fri, 8 May 2026 14:03:49 +0200 [thread overview]
Message-ID: <08310532-c78a-4767-aea2-abc86877a156@intel.com> (raw)
In-Reply-To: <20260226042834.2963245-9-matthew.brost@intel.com>
On 26/02/2026 05:28, Matthew Brost wrote:
> Some Xe platforms can generate pagefault storms where many faults target
> the same address range in a short time window (e.g. many EU threads
> faulting the same page). The current worker/locking model effectively
> serializes faults for a given range and repeatedly performs VMA/range
> lookups for each fault, which creates head-of-queue blocking and wastes
> CPU in the hot path.
>
> Introduce a page fault chaining cache that coalesces faults targeting
> the samr ASID and address range.
>
> Each worker tracks the active fault range it is servicing. Fault entries
> reside in stable queue storage, allowing the IRQ handler to match new
> faults against the worker cache and directly chain cache hits onto the
> active entry without allocation or waiting for dequeue. Once the leading
> fault completes, the worker acknowledges the entire chain.
>
> A small allocation state is added to each entry so queue, worker, and
> IRQ pathd can safely reference the same fault object. This prevents
> reuse while the fault is active and guarantees that chained faults
> remain valid until acknowledged.
>
> Fault handlers also record the serviced range so subsequent faults can
> be acknowledged without re-running the full resolution path.
>
> This removes repeated fault resolution during fault storms and
> significantly improves forward progress in SVM workloads.
>
> Assisted-by: Chat-GPT # Documentation
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
> drivers/gpu/drm/xe/xe_pagefault.c | 434 +++++++++++++++++++++---
> drivers/gpu/drm/xe/xe_pagefault.h | 71 ++++
> drivers/gpu/drm/xe/xe_pagefault_types.h | 78 +++--
> drivers/gpu/drm/xe/xe_svm.c | 16 +-
> drivers/gpu/drm/xe/xe_svm.h | 9 +-
> 5 files changed, 523 insertions(+), 85 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
> index 030452923ab9..9c14f9505faf 100644
> --- a/drivers/gpu/drm/xe/xe_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> @@ -35,6 +35,70 @@
> * xe_pagefault.c implements the consumer layer.
> */
>
> +/**
> + * DOC: Xe page fault cache
> + *
> + * Some Xe hardware can trigger “fault storms,” which are many page faults to
> + * the same address within a short period of time. An example is many EU threads
> + * faulting on the same page simultaneously. With the current page fault locking
> + * structure, only one page fault for a given address range can be processed at
> + * a time. This causes head-of-queue blocking across workers, killing
> + * parallelism. If the page fault handler must repeatedly look up resources
> + * (VMAs, ranges) to determine that the pages are valid for each fault in the
> + * storm, the time complexity grows rapidly.
> + *
> + * To address this, each page fault worker maintains a cache of the active fault
> + * being processed. Subsequent faults that hit in the cache are chained to the
> + * pending fault, and all chained faults are acknowledged once the initial fault
> + * completes. This alleviates head-of-queue blocking and quickly chains faults
> + * in the upper layers, avoiding expensive lookups in the main fault-handling
> + * path.
> + *
> + * Faults are buffered in the page fault queue in a way that provides stable
> + * storage for outstanding faults. In particular, faults may be chained directly
> + * while still resident in the queue storage (i.e., outside the worker’s current
> + * head/tail dequeue position). This allows the IRQ handler to match newly
> + * arrived faults against the per-worker cache and immediately chain cache hits
> + * onto the active fault under the queue lock, without allocating memory or
> + * waiting for the worker to pop the fault first.
> + *
> + * A per-fault state field is used to assert correctness of these invariants.
> + * The state tracks whether an entry is free, queued, chained, or currently
> + * active. Transitions are performed under the page fault queue lock, and the
> + * worker acknowledges faults by walking the chain and returning entries to the
> + * free state once they are complete.
> + */
> +
> +/**
> + * enum xe_pagefault_alloc_state - lifetime state for a page fault queue entry
> + * @XE_PAGEFAULT_ALLOC_STATE_FREE:
> + * Entry is unused and may be overwritten by the producer, consumer retry
> + * or requeue..
> + * @XE_PAGEFAULT_ALLOC_STATE_QUEUED:
> + * Entry has been enqueued and may be dequeued by a worker.
> + * @XE_PAGEFAULT_ALLOC_STATE_ACTIVE:
> + * Entry has been dequeued and is the worker's currently serviced fault.
> + * The worker may attach additional faults to it via consumer.next.
> + * @XE_PAGEFAULT_ALLOC_STATE_CHAINED:
> + * Entry is not independently serviced; it has been chained onto an
> + * ACTIVE entry via consumer.next and will be acknowledged when the
> + * leading fault completes.
> + *
> + * The page fault queue provides stable storage for outstanding faults so the
> + * IRQ handler can chain new cache hits directly onto a worker's active fault.
> + * Because entries may remain referenced outside the consumer dequeue window,
> + * the producer must only write into entries in the FREE state.
> + *
> + * State transitions are protected by the page fault queue lock. Workers return
> + * entries to FREE after acknowledging the fault (either as ACTIVE or CHAINED).
> + */
> +enum xe_pagefault_alloc_state {
> + XE_PAGEFAULT_ALLOC_STATE_FREE = 0,
> + XE_PAGEFAULT_ALLOC_STATE_QUEUED = 1,
> + XE_PAGEFAULT_ALLOC_STATE_CHAINED = 2,
> + XE_PAGEFAULT_ALLOC_STATE_ACTIVE = 3,
> +};
> +
> static int xe_pagefault_entry_size(void)
> {
> /*
> @@ -64,7 +128,7 @@ static int xe_pagefault_begin(struct drm_exec *exec, struct xe_vma *vma,
> }
>
> static int xe_pagefault_handle_vma(struct xe_gt *gt, struct xe_vma *vma,
> - bool atomic)
> + struct xe_pagefault *pf, bool atomic)
> {
> struct xe_vm *vm = xe_vma_vm(vma);
> struct xe_tile *tile = gt_to_tile(gt);
> @@ -89,8 +153,11 @@ static int xe_pagefault_handle_vma(struct xe_gt *gt, struct xe_vma *vma,
>
> /* Check if VMA is valid, opportunistic check only */
> if (xe_vm_has_valid_gpu_mapping(tile, vma->tile_present,
> - vma->tile_invalidated) && !atomic)
> + vma->tile_invalidated) && !atomic) {
> + xe_pagefault_set_start_addr(pf, xe_vma_start(vma));
> + xe_pagefault_set_end_addr(pf, xe_vma_end(vma));
> return 0;
> + }
>
> do {
> if (xe_vma_is_userptr(vma) &&
> @@ -128,6 +195,10 @@ static int xe_pagefault_handle_vma(struct xe_gt *gt, struct xe_vma *vma,
> } while (err == -EAGAIN);
>
> if (!err) {
> + /* Give hint to immediately ack faults */
> + xe_pagefault_set_start_addr(pf, xe_vma_start(vma));
> + xe_pagefault_set_end_addr(pf, xe_vma_end(vma));
> +
> dma_fence_wait(fence, false);
> dma_fence_put(fence);
> }
> @@ -199,10 +270,10 @@ static int xe_pagefault_service(struct xe_pagefault *pf)
> atomic = xe_pagefault_access_is_atomic(pf->consumer.access_type);
>
> if (xe_vma_is_cpu_addr_mirror(vma))
> - err = xe_svm_handle_pagefault(vm, vma, gt,
> + err = xe_svm_handle_pagefault(vm, vma, pf, gt,
> pf->consumer.page_addr, atomic);
> else
> - err = xe_pagefault_handle_vma(gt, vma, atomic);
> + err = xe_pagefault_handle_vma(gt, vma, pf, atomic);
>
> unlock_vm:
> if (!err)
> @@ -214,33 +285,228 @@ static int xe_pagefault_service(struct xe_pagefault *pf)
> return err;
> }
>
> +#define XE_PAGEFAULT_CACHE_START_INVALID U64_MAX
> +#define xe_pagefault_cache_start_invalidate(val) \
> + (val = XE_PAGEFAULT_CACHE_START_INVALID)
> +
> +static void
> +__xe_pagefault_cache_invalidate(struct xe_pagefault_queue *pf_queue,
> + struct xe_pagefault_work *pf_work)
> +{
> + lockdep_assert_held(&pf_queue->lock);
> +
> + xe_pagefault_cache_start_invalidate(pf_work->cache.start);
> +}
> +
> +static void
> +xe_pagefault_cache_invalidate(struct xe_pagefault_queue *pf_queue,
> + struct xe_pagefault_work *pf_work)
> +{
> + spin_lock_irq(&pf_queue->lock);
> + __xe_pagefault_cache_invalidate(pf_queue, pf_work);
> + spin_unlock_irq(&pf_queue->lock);
> +}
> +
> +static bool xe_pagefault_queue_full(struct xe_pagefault_queue *pf_queue)
> +{
> + lockdep_assert_held(&pf_queue->lock);
> +
> + return CIRC_SPACE(pf_queue->head, pf_queue->tail,
> + pf_queue->size) <= xe_pagefault_entry_size();
> +}
> +
> +static struct xe_pagefault *
> +__xe_pagefault_queue_add(struct xe_pagefault_queue *pf_queue,
> + struct xe_pagefault *pf)
> +{
> + struct xe_device *xe = container_of(pf_queue, typeof(*xe),
> + usm.pf_queue);
> + struct xe_pagefault *lpf;
> +
> + lockdep_assert_held(&pf_queue->lock);
> +
> + do {
> + xe_assert(xe, !xe_pagefault_queue_full(pf_queue));
> +
> + lpf = (pf_queue->data + pf_queue->head);
> + pf_queue->head = (pf_queue->head + xe_pagefault_entry_size()) %
> + pf_queue->size;
> + } while (lpf->consumer.alloc_state != XE_PAGEFAULT_ALLOC_STATE_FREE);
> +
> + memcpy(lpf, pf, sizeof(*pf));
> + lpf->consumer.alloc_state = XE_PAGEFAULT_ALLOC_STATE_QUEUED;
> +
> + return lpf;
> +}
> +
> static void xe_pagefault_queue_retry(struct xe_pagefault_queue *pf_queue,
> - struct xe_pagefault *pf)
> + struct xe_pagefault *pf,
> + struct xe_pagefault_work *pf_work)
> {
> + struct xe_device *xe = container_of(pf_queue, typeof(*xe),
> + usm.pf_queue);
> +
> + xe_assert(xe, pf->consumer.alloc_state ==
> + XE_PAGEFAULT_ALLOC_STATE_ACTIVE);
> +
> spin_lock_irq(&pf_queue->lock);
> - if (!pf_queue->tail)
> - pf_queue->tail = pf_queue->size - xe_pagefault_entry_size();
> - else
> - pf_queue->tail -= xe_pagefault_entry_size();
> - memcpy(pf_queue->data + pf_queue->tail, pf, sizeof(*pf));
> + pf->consumer.alloc_state = XE_PAGEFAULT_ALLOC_STATE_FREE;
> + __xe_pagefault_queue_add(pf_queue, pf);
> + __xe_pagefault_cache_invalidate(pf_queue, pf_work);
> spin_unlock_irq(&pf_queue->lock);
> }
>
> -static bool xe_pagefault_queue_pop(struct xe_pagefault_queue *pf_queue,
> - struct xe_pagefault *pf)
> +static struct xe_pagefault *
> +xe_pagefault_queue_requeue(struct xe_pagefault_queue *pf_queue,
> + struct xe_pagefault *pf, struct xe_gt *gt)
This is specifically for chained entries and it's _unchain_and_requeue
actually.
> {
> - bool found_fault = false;
> + struct xe_device *xe = container_of(pf_queue, typeof(*xe),
> + usm.pf_queue);
> + struct xe_pagefault *next = pf->consumer.next, *lpf;
> +
> + xe_assert(xe, pf->consumer.alloc_state ==
> + XE_PAGEFAULT_ALLOC_STATE_CHAINED);
>
> spin_lock_irq(&pf_queue->lock);
> - if (pf_queue->tail != pf_queue->head) {
> - memcpy(pf, pf_queue->data + pf_queue->tail, sizeof(*pf));
> - pf_queue->tail = (pf_queue->tail + xe_pagefault_entry_size()) %
> - pf_queue->size;
> - found_fault = true;
> - }
> + pf->consumer.alloc_state = XE_PAGEFAULT_ALLOC_STATE_FREE;
> + lpf = __xe_pagefault_queue_add(pf_queue, pf);
> + lpf->consumer.next = NULL;
> + lpf->consumer.fault_type_level |= XE_PAGEFAULT_REQUEUE_MASK;
> spin_unlock_irq(&pf_queue->lock);
>
> - return found_fault;
> + return next;
> +}
> +
> +static bool xe_pagefault_cache_match(struct xe_pagefault *pf, u64 start,
> + u64 end, u64 cache_asid)
Maybe without 'cache' since the comparison is for function's arguments.
> +{
> + struct xe_device *xe = gt_to_xe(pf->gt);
> + u64 page_addr = pf->consumer.page_addr;
> + u32 pf_asid = pf->consumer.asid;
> +
> + xe_assert(xe, pf->consumer.alloc_state !=
> + XE_PAGEFAULT_ALLOC_STATE_FREE);
> +
> + return page_addr >= start && page_addr < end &&
> + pf_asid == cache_asid;
> +}
> +
> +static bool xe_pagefault_cache_hit(struct xe_pagefault_queue *pf_queue,
> + struct xe_pagefault *pf)
I would say that the function chains entry to worker's active chain.
I think you could rename it to xe_pagefault_entry_chained() or similar.
From 'cache_hit' I expect just the answer if it matches to any cache.
> +{
> + struct xe_device *xe = container_of(pf_queue, typeof(*xe),
> + usm.pf_queue);
> + struct xe_pagefault_work *pf_work;
> + bool requeue = FIELD_GET(XE_PAGEFAULT_REQUEUE_MASK,
> + pf->consumer.fault_type_level);
> + int i;
> +
> + lockdep_assert_held(&pf_queue->lock);
> + xe_assert(xe, pf->consumer.alloc_state ==
> + XE_PAGEFAULT_ALLOC_STATE_QUEUED);
> +
> + /*
> + * If this is a retry, we may already have a chain attached. In that
> + * case, we cannot hit in the cache because chains cannot easily be
> + * combined.
> + */
> + if (pf->consumer.next)
> + return false;
> +
> + for (i = 0, pf_work = xe->usm.pf_workers;
> + i < xe->info.num_pf_work; ++i, ++pf_work) {
> + u64 start = pf_work->cache.start;
> + u64 end = requeue ? start + SZ_4K : pf_work->cache.end;
> + u32 asid = pf_work->cache.asid;
> +
> + if (xe_pagefault_cache_match(pf, start, end, asid)) {
> + xe_assert(xe, pf_work->cache.pf->consumer.alloc_state ==
> + XE_PAGEFAULT_ALLOC_STATE_ACTIVE);
> +
> + pf->consumer.alloc_state =
> + XE_PAGEFAULT_ALLOC_STATE_CHAINED;
> + pf->consumer.next = pf_work->cache.pf->consumer.next;
> + pf_work->cache.pf->consumer.next = pf;
> +
> + return true;
> + }
> + }
> +
> + return false;
> +}
> +
> +static void xe_pagefault_queue_advance(struct xe_pagefault_queue *pf_queue)
> +{
> + lockdep_assert_held(&pf_queue->lock);
> +
> + pf_queue->tail = (pf_queue->tail + xe_pagefault_entry_size()) %
> + pf_queue->size;
> +}
> +
> +static bool xe_pagefault_queue_pop(struct xe_pagefault_queue *pf_queue,
> + struct xe_pagefault **pf, int id)
> +{
> + struct xe_device *xe = container_of(pf_queue, typeof(*xe),
> + usm.pf_queue);
> + struct xe_pagefault_work *pf_work;
> + struct xe_pagefault *lpf;
> + size_t align = SZ_2M;
> +
> + guard(spinlock_irq)(&pf_queue->lock);
> +
> + for (*pf = NULL; !*pf;) {
> + if (pf_queue->tail == pf_queue->head)
> + return false;
xe_pagefault_queue_empty(pf_queue)?
> +
> + lpf = (pf_queue->data + pf_queue->tail);
Maybe add a helper
static inline struct xe_pagefault *xe_pagefault_queue_get_tail_locked(struct xe_pagefault_queue *pf_queue)
{
return pf_queue->data + pf_queue->tail;
}
Getting tail is used more than once in a code.
> + xe_pagefault_queue_advance(pf_queue);
> +
> + if (lpf->consumer.alloc_state !=
> + XE_PAGEFAULT_ALLOC_STATE_QUEUED)
> + continue;
> +
> + if (xe_pagefault_cache_hit(pf_queue, lpf))
> + continue;
This chains STATE_QUEUED entry into matching worker.
> +
> + *pf = lpf; /* Hand back page fault for processing */
> + }
> +
> + /*
> + * No cache hit; allocate a new cache entry. We assume most faults
> + * within a 2M range will hit the same pages. If this assumption proves
> + * false, the mismatched fault is requeued after the initial fault is
> + * acknowledged.
> + */
> + pf_work = xe->usm.pf_workers + id;
> + if (FIELD_GET(XE_PAGEFAULT_REQUEUE_MASK,
> + lpf->consumer.fault_type_level))
> + align = SZ_4K;
> + pf_work->cache.start = ALIGN_DOWN(lpf->consumer.page_addr, align);
> + pf_work->cache.end = pf_work->cache.start + align;
> + pf_work->cache.asid = lpf->consumer.asid;
> + pf_work->cache.pf = lpf;
> + lpf->consumer.alloc_state = XE_PAGEFAULT_ALLOC_STATE_ACTIVE;
> +
> + /* Drain queue until empty or new fault found */
> + while (1) {
> + if (pf_queue->tail == pf_queue->head)
> + break;
> +
> + lpf = (pf_queue->data + pf_queue->tail);
> +
> + if (lpf->consumer.alloc_state !=
> + XE_PAGEFAULT_ALLOC_STATE_QUEUED) {
> + xe_pagefault_queue_advance(pf_queue);
> + continue;
> + }
> +
> + if (!xe_pagefault_cache_hit(pf_queue, lpf))
> + break;
> +
> + xe_pagefault_queue_advance(pf_queue);
> + }
> +
In this while(1) initial chaining begins. If there are queued pagefaults
before any worker starts in this loop they will be chained for the first
time.
> + return true;
> }
>
> static void xe_pagefault_print(struct xe_pagefault *pf)
> @@ -276,40 +542,91 @@ static void xe_pagefault_queue_work(struct work_struct *w)
> container_of(w, typeof(*pf_work), work);
> struct xe_device *xe = pf_work->xe;
> struct xe_pagefault_queue *pf_queue = &xe->usm.pf_queue;
> - struct xe_pagefault pf;
> + struct xe_pagefault *pf;
> ktime_t start = xe_gt_stats_ktime_get();
> - struct xe_gt *gt = NULL;
> unsigned long threshold;
> + u64 cache_start = XE_PAGEFAULT_CACHE_START_INVALID, cache_end = 0;
> + u32 cache_asid = 0;
>
> #define USM_QUEUE_MAX_RUNTIME_MS 20
> threshold = jiffies + msecs_to_jiffies(USM_QUEUE_MAX_RUNTIME_MS);
>
> - while (xe_pagefault_queue_pop(pf_queue, &pf)) {
> - int err;
>
> - if (!pf.gt) /* Fault squashed during reset */
> - continue;
> + while (xe_pagefault_queue_pop(pf_queue, &pf, pf_work->id)) {
> + struct xe_gt *gt = pf->gt;
> + u32 asid = pf->consumer.asid;
> + int err = 0;
> +
> + /* Last fault same address, ack immediately */
> + if (xe_pagefault_cache_match(pf, cache_start, cache_end,
> + cache_asid))
> + goto ack_fault;
>
> - gt = pf.gt;
> - err = xe_pagefault_service(&pf);
> + err = xe_pagefault_service(pf);
>
> if (err == -EAGAIN) {
> - xe_pagefault_queue_retry(pf_queue, &pf);
> + xe_pagefault_queue_retry(pf_queue, pf, pf_work);
> queue_work(xe->usm.pf_wq, w);
> break;
> - } else if (err) {
> - if (!(pf.consumer.access_type & XE_PAGEFAULT_ACCESS_PREFETCH)) {
> - xe_pagefault_print(&pf);
> - xe_gt_info(pf.gt, "Fault response: Unsuccessful %pe\n",
> + } else if (err ) {
Checkpatch will catch it anyway. Space after err.
> + if (!(pf->consumer.access_type & XE_PAGEFAULT_ACCESS_PREFETCH)) {
> + xe_pagefault_cache_start_invalidate(cache_start);
> + xe_pagefault_print(pf);
> + xe_gt_info(pf->gt, "Fault response: Unsuccessful %pe\n",
> ERR_PTR(err));
> } else {
> - xe_gt_stats_incr(pf.gt, XE_GT_STATS_ID_INVALID_PREFETCH_PAGEFAULT_COUNT, 1);
> - xe_gt_dbg(pf.gt, "Prefetch Fault response: Unsuccessful %pe\n",
> + xe_gt_stats_incr(pf->gt, XE_GT_STATS_ID_INVALID_PREFETCH_PAGEFAULT_COUNT, 1);
> + xe_gt_dbg(pf->gt, "Prefetch Fault response: Unsuccessful %pe\n",
> ERR_PTR(err));
> }
> + } else {
> + /* Cache valid fault locally */
> + cache_start = xe_pagefault_start_addr(pf);
> + cache_end = xe_pagefault_end_addr(pf);
> + cache_asid = asid;
> }
>
> - pf.producer.ops->ack_fault(&pf, err);
> +ack_fault:
> + xe_assert(xe, pf->consumer.alloc_state ==
> + XE_PAGEFAULT_ALLOC_STATE_ACTIVE);
> + xe_assert(xe, pf == pf_work->cache.pf);
> +
> + while (pf) {
> + struct xe_pagefault *next;
> +
> + xe_assert(xe, pf->consumer.alloc_state ==
> + XE_PAGEFAULT_ALLOC_STATE_CHAINED ||
> + pf->consumer.alloc_state ==
> + XE_PAGEFAULT_ALLOC_STATE_ACTIVE);
> +
> + pf->producer.ops->ack_fault(pf, err);
> +
> + if (pf->consumer.alloc_state ==
> + XE_PAGEFAULT_ALLOC_STATE_ACTIVE)
> + xe_pagefault_cache_invalidate(pf_queue,
> + pf_work);
This gives me a lot of thinking.
Why to invalidate the cache now? The xe_pagefault_handler tries to chain
incoming entries with existing chain but it has limitations
I think a lock could be grabbed here or above since the whole section
here will become protected.
> +
> + /*
> + * Removed from the cache, so next is stable within this
> + * chain. Once alloc_state transitions to
> + * XE_PAGEFAULT_ALLOC_STATE_FREE, the local entry must
> + * not be touched.
> + */
> + next = pf->consumer.next;
> + WRITE_ONCE(pf->consumer.alloc_state,
> + XE_PAGEFAULT_ALLOC_STATE_FREE);
> + pf = next;
> +
> + /*
> + * Requeue chained faults which do not match the last
> + * fault processed
> + */
> + while (pf && !xe_pagefault_cache_match(pf, cache_start,
> + cache_end,
> + cache_asid))
> + pf = xe_pagefault_queue_requeue(pf_queue, pf,
> + gt);
After 'while' if pf then update the pf->consumer.alloc_state to ACTIVE,
update the worker cache to pf
then release the lock. xe_pagefault_queue_requeue becomes _locked then.
This way the handler could add new entries while worker is active.
Or potentially queue_pop could catch later on such loose entry and chain
it...
Second thing. I'm not sure I understand this requeue action.
Is it for instance because initially all 2MB range pagefaults were
chained and after xe_pagefault_service() the range was narrowed down to 4k?
If so then some of those pagefaults should really be unchained and requeued.
> + }
>
> if (time_after(jiffies, threshold)) {
> queue_work(xe->usm.pf_wq, w);
> @@ -318,10 +635,8 @@ static void xe_pagefault_queue_work(struct work_struct *w)
> }
> #undef USM_QUEUE_MAX_RUNTIME_MS
>
> - if (gt)
> - xe_gt_stats_incr(xe_root_mmio_gt(gt_to_xe(gt)),
> - XE_GT_STATS_ID_PAGEFAULT_US,
> - xe_gt_stats_ktime_us_delta(start));
> + xe_gt_stats_incr(xe_root_mmio_gt(xe), XE_GT_STATS_ID_PAGEFAULT_US,
> + xe_gt_stats_ktime_us_delta(start));
> }
>
> static int xe_pagefault_queue_init(struct xe_device *xe,
> @@ -408,6 +723,7 @@ int xe_pagefault_init(struct xe_device *xe)
>
> pf_work->xe = xe;
> pf_work->id = i;
> + xe_pagefault_cache_start_invalidate(pf_work->cache.start);
> INIT_WORK(&pf_work->work, xe_pagefault_queue_work);
> }
>
> @@ -430,12 +746,15 @@ static void xe_pagefault_queue_reset(struct xe_device *xe, struct xe_gt *gt,
> /* Squash all pending faults on the GT */
>
> spin_lock_irq(&pf_queue->lock);
> - for (i = pf_queue->tail; i != pf_queue->head;
> - i = (i + xe_pagefault_entry_size()) % pf_queue->size) {
> + for (i = 0; i < pf_queue->size; i += xe_pagefault_entry_size()) {
> struct xe_pagefault *pf = pf_queue->data + i;
>
> - if (pf->gt == gt)
> - pf->gt = NULL;
> + if (pf->gt != gt)
> + continue;
> +
> + pf->consumer.alloc_state =
> + XE_PAGEFAULT_ALLOC_STATE_FREE;
> + pf->consumer.next = NULL;
> }
> spin_unlock_irq(&pf_queue->lock);
> }
> @@ -453,12 +772,11 @@ void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
> xe_pagefault_queue_reset(xe, gt, &xe->usm.pf_queue);
> }
>
> -static bool xe_pagefault_queue_full(struct xe_pagefault_queue *pf_queue)
> +static bool xe_pagefault_queue_empty(struct xe_pagefault_queue *pf_queue)
> {
> lockdep_assert_held(&pf_queue->lock);
>
> - return CIRC_SPACE(pf_queue->head, pf_queue->tail, pf_queue->size) <=
> - xe_pagefault_entry_size();
> + return pf_queue->head == pf_queue->tail;
> }
>
> /*
> @@ -486,18 +804,26 @@ int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf)
> {
> struct xe_pagefault_queue *pf_queue = &xe->usm.pf_queue;
> unsigned long flags;
> - int work_index;
> bool full;
>
> spin_lock_irqsave(&pf_queue->lock, flags);
> - work_index = xe_pagefault_work_index(xe);
> full = xe_pagefault_queue_full(pf_queue);
> if (!full) {
> - memcpy(pf_queue->data + pf_queue->head, pf, sizeof(*pf));
> - pf_queue->head = (pf_queue->head + xe_pagefault_entry_size()) %
> - pf_queue->size;
> - queue_work(xe->usm.pf_wq,
> - &xe->usm.pf_workers[work_index].work);
> + struct xe_pagefault *lpf;
> + bool empty = xe_pagefault_queue_empty(pf_queue);
> +
> + lpf = __xe_pagefault_queue_add(pf_queue, pf);
> + lpf->consumer.next = NULL;
> +
> + if (xe_pagefault_cache_hit(pf_queue, lpf)) {
> + if (empty)
> + xe_pagefault_queue_advance(pf_queue);
> + } else {
> + int work_index = xe_pagefault_work_index(xe);
> +
> + queue_work(xe->usm.pf_wq,
> + &xe->usm.pf_workers[work_index].work);
> + }
> } else {
> drm_warn(&xe->drm,
> "PageFault Queue full, shouldn't be possible\n");
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.h b/drivers/gpu/drm/xe/xe_pagefault.h
> index bd0cdf9ed37f..feaf2a69674a 100644
> --- a/drivers/gpu/drm/xe/xe_pagefault.h
> +++ b/drivers/gpu/drm/xe/xe_pagefault.h
> @@ -6,6 +6,8 @@
> #ifndef _XE_PAGEFAULT_H_
> #define _XE_PAGEFAULT_H_
>
> +#include "xe_pagefault_types.h"
> +
> struct xe_device;
> struct xe_gt;
> struct xe_pagefault;
> @@ -16,4 +18,73 @@ void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt);
>
> int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf);
>
> +#define XE_PAGEFAULT_END_ADDR_MASK (~0xfffull)
> +
> +/**
> + * xe_pagefault_set_end_addr() - store serviced range end for a pagefault
> + * @pf: Pagefault entry
> + * @end_addr: Inclusive end address of the serviced fault range
> + *
> + * The pagefault consumer stores the resolved fault range so subsequent faults
> + * hitting the same range can be immediately acknowledged without re-running
> + * the full fault handling path.
> + *
> + * The end address shares storage with other consumer metadata and therefore
> + * must be masked with %XE_PAGEFAULT_END_ADDR_MASK before storing. Bits outside
> + * the mask are reserved for internal state tracking and must be preserved.
> + */
> +static inline void
> +xe_pagefault_set_end_addr(struct xe_pagefault *pf, u64 end_addr)
> +{
> + pf->consumer.end_addr &= ~XE_PAGEFAULT_END_ADDR_MASK;
> + pf->consumer.end_addr |= end_addr;
> +}
> +
> +/**
> + * xe_pagefault_end_addr() - read serviced range end for a pagefault
> + * @pf: Pagefault entry
> + *
> + * Returns the inclusive end address of the range previously recorded by
> + * xe_pagefault_set_end_addr(). Only the bits covered by
> + * %XE_PAGEFAULT_END_ADDR_MASK are returned; other bits in the storage are
> + * reserved for internal state.
> + *
> + * Return: End address of the serviced fault range.
> + */
> +static inline u64 xe_pagefault_end_addr(struct xe_pagefault *pf)
> +{
> + return pf->consumer.end_addr & XE_PAGEFAULT_END_ADDR_MASK;
> +}
> +
> +#undef XE_PAGEFAULT_END_ADDR_MASK
> +
> +/**
> + * xe_pagefault_set_start_addr() - store serviced range start for a pagefault
> + * @pf: Pagefault entry
> + * @start_addr: Start address of the serviced fault range
> + *
> + * The pagefault consumer stores the resolved fault range so subsequent faults
> + * hitting the same range can be immediately acknowledged without re-running
> + * the full fault handling path.
> + */
> +static inline void
> +xe_pagefault_set_start_addr(struct xe_pagefault *pf, u64 start_addr)
> +{
> + pf->consumer.page_addr = start_addr;
> +}
> +
> +/**
> + * xe_pagefault_start_addr() - read serviced range start for a pagefault
> + * @pf: Pagefault entry
> + *
> + * Returns the inclusive start address of the range previously recorded by
> + * xe_pagefault_set_start_addr().
> + *
> + * Return: Start address of the serviced fault range.
> + */
> +static inline u64 xe_pagefault_start_addr(struct xe_pagefault *pf)
> +{
> + return pf->consumer.page_addr;
> +}
> +
> #endif
> diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h b/drivers/gpu/drm/xe/xe_pagefault_types.h
> index 75bc53205601..57cb292105d7 100644
> --- a/drivers/gpu/drm/xe/xe_pagefault_types.h
> +++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
> @@ -60,36 +60,55 @@ struct xe_pagefault {
> /**
> * @consumer: State for the software handling the fault. Populated by
> * the producer and may be modified by the consumer to communicate
> - * information back to the producer upon fault acknowledgment.
> + * information back to the producer upon fault acknowledgment. After
> + * fault acknowledgment, the producer should only access consumer fields
> + * via well defined helpers.
> */
> struct {
> - /** @consumer.page_addr: address of page fault */
> - u64 page_addr;
> - /** @consumer.asid: address space ID */
> - u32 asid;
> /**
> - * @consumer.access_type: access type and prefetch flag packed
> - * into a u8.
> + * @consumer.page_addr: address of page fault, populated by
> + * consumer after fault completion
> */
> - u8 access_type;
> + u64 page_addr;
> + union {
> + struct {
> + /** @alloc_state: page fault allocation state */
> + u8 alloc_state;
> + /**
> + * @consumer.access_type: access type, u8 rather
> + * than enum to keep size compact
> + */
> + u8 access_type;
> #define XE_PAGEFAULT_ACCESS_TYPE_MASK GENMASK(1, 0)
> #define XE_PAGEFAULT_ACCESS_PREFETCH BIT(7)
> - /**
> - * @consumer.fault_type_level: fault type and level, u8 rather
> - * than enum to keep size compact
> - */
> - u8 fault_type_level;
> + /**
> + * @consumer.fault_type_level: fault type and
> + * level, u8 rather than enum to keep size
> + * compact
> + */
> + u8 fault_type_level;
> #define XE_PAGEFAULT_TYPE_LEVEL_NACK 0xff /* Producer indicates nack fault */
> -#define XE_PAGEFAULT_LEVEL_MASK GENMASK(3, 0)
> -#define XE_PAGEFAULT_TYPE_MASK GENMASK(7, 4)
> - /** @consumer.engine_class_instance: engine class and instance */
> - u8 engine_class_instance;
> +#define XE_PAGEFAULT_LEVEL_MASK GENMASK(2, 0)
> +#define XE_PAGEFAULT_TYPE_MASK GENMASK(6, 3)
> +#define XE_PAGEFAULT_REQUEUE_MASK BIT(7)
> + /** @consumer.engine_class_instance: engine class and instance */
> + u8 engine_class_instance;
> #define XE_PAGEFAULT_ENGINE_CLASS_MASK GENMASK(3, 0)
> #define XE_PAGEFAULT_ENGINE_INSTANCE_MASK GENMASK(7, 4)
> - /** @pad: alignment padding */
> - u8 pad;
> - /** consumer.reserved: reserved bits for future expansion */
> - u64 reserved;
> + /** @consumer.asid: address space ID */
> + u32 asid;
> + };
> + /**
> + * @consumer.end_addr: end address of page fault,
> + * populated by consumer after fault completion
> + */
> + u64 end_addr;
> + };
> + /**
> + * @consumer.next: next pagefault chained to this fault,
> + * protected by pf_queue lock
> + */
> + struct xe_pagefault *next;
> } consumer;
> /**
> * @producer: State for the producer (i.e., HW/FW interface). Populated
> @@ -131,7 +150,7 @@ struct xe_pagefault_queue {
> u32 head;
> /** @tail: Tail pointer in bytes, moved by consumer, protected by @lock */
> u32 tail;
> - /** @lock: protects page fault queue */
> + /** @lock: protects page fault queue, workers caches */
> spinlock_t lock;
> };
>
> @@ -146,6 +165,21 @@ struct xe_pagefault_work {
> struct xe_device *xe;
> /** @id: Identifier for this work item */
> int id;
> + /**
> + * @cache: Page fault cache for the currently processed fault
> + *
> + * Protected by the page fault queue lock.
> + */
> + struct {
> + /** @cache.start: Start address of the current page fault */
> + u64 start;
> + /** @cache.end: End address of the current page fault */
> + u64 end;
> + /** @cache.asid: Address space ID of the current page fault */
> + u32 asid;
> + /** @cache.pf: Pointer to the current page fault */
> + struct xe_pagefault *pf;
> + } cache;
> /** @work: Work item used to process the page fault */
> struct work_struct work;
> };
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index 66eee490a0c3..fc439fd85187 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -15,6 +15,7 @@
> #include "xe_gt_stats.h"
> #include "xe_migrate.h"
> #include "xe_module.h"
> +#include "xe_pagefault.h"
> #include "xe_pm.h"
> #include "xe_pt.h"
> #include "xe_svm.h"
> @@ -1215,8 +1216,8 @@ DECL_SVM_RANGE_US_STATS(bind, BIND)
> DECL_SVM_RANGE_US_STATS(fault, PAGEFAULT)
>
> static int __xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> - struct xe_gt *gt, u64 fault_addr,
> - bool need_vram)
> + struct xe_pagefault *pf, struct xe_gt *gt,
> + u64 fault_addr, bool need_vram)
> {
> int devmem_possible = IS_DGFX(vm->xe) &&
> IS_ENABLED(CONFIG_DRM_XE_PAGEMAP);
> @@ -1372,6 +1373,10 @@ static int __xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> xe_svm_range_bind_us_stats_incr(gt, range, bind_start);
>
> out:
> + /* Give hint to immediately ack faults */
> + xe_pagefault_set_start_addr(pf, xe_svm_range_start(range));
> + xe_pagefault_set_end_addr(pf, xe_svm_range_end(range));
> +
> xe_svm_range_fault_us_stats_incr(gt, range, start);
> mutex_unlock(&range->lock);
> drm_gpusvm_range_put(&range->base);
> @@ -1394,6 +1399,7 @@ static int __xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> * xe_svm_handle_pagefault() - SVM handle page fault
> * @vm: The VM.
> * @vma: The CPU address mirror VMA.
> + * @pf: Pagefault structure
> * @gt: The gt upon the fault occurred.
> * @fault_addr: The GPU fault address.
> * @atomic: The fault atomic access bit.
> @@ -1404,8 +1410,8 @@ static int __xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> * Return: 0 on success, negative error code on error.
> */
> int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> - struct xe_gt *gt, u64 fault_addr,
> - bool atomic)
> + struct xe_pagefault *pf, struct xe_gt *gt,
> + u64 fault_addr, bool atomic)
> {
> int need_vram, ret;
> retry:
> @@ -1413,7 +1419,7 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> if (need_vram < 0)
> return need_vram;
>
> - ret = __xe_svm_handle_pagefault(vm, vma, gt, fault_addr,
> + ret = __xe_svm_handle_pagefault(vm, vma, pf, gt, fault_addr,
> need_vram ? true : false);
> if (ret == -EAGAIN) {
> /*
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index ebcca34f7f4d..07be92579971 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -21,6 +21,7 @@ struct drm_file;
> struct xe_bo;
> struct xe_gt;
> struct xe_device;
> +struct xe_pagefault;
> struct xe_vram_region;
> struct xe_tile;
> struct xe_vm;
> @@ -107,8 +108,8 @@ void xe_svm_fini(struct xe_vm *vm);
> void xe_svm_close(struct xe_vm *vm);
>
> int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> - struct xe_gt *gt, u64 fault_addr,
> - bool atomic);
> + struct xe_pagefault *pf, struct xe_gt *gt,
> + u64 fault_addr, bool atomic);
>
> bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end);
>
> @@ -296,8 +297,8 @@ void xe_svm_close(struct xe_vm *vm)
>
> static inline
> int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> - struct xe_gt *gt, u64 fault_addr,
> - bool atomic)
> + struct xe_pagefault *pf, struct xe_gt *gt,
> + u64 fault_addr, bool atomic)
> {
> return 0;
> }
This is huge and impressive work! The patch and the whole series.
I can't say I'm done with the review. Will continue it as need to get
the understanding of it for
Intel Xe GPU Debug Support (eudebug)
(https://patchwork.freedesktop.org/series/165777/)
Regards,
Maciej
next prev parent reply other threads:[~2026-05-08 12:04 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-26 4:28 [PATCH v4 00/12] Fine grained fault locking, threaded prefetch, storm cache Matthew Brost
2026-02-26 4:28 ` [PATCH v4 01/12] drm/xe: Fine grained page fault locking Matthew Brost
2026-02-26 4:28 ` [PATCH v4 02/12] drm/xe: Allow prefetch-only VM bind IOCTLs to use VM read lock Matthew Brost
2026-02-26 4:28 ` [PATCH v4 03/12] drm/xe: Thread prefetch of SVM ranges Matthew Brost
2026-02-26 4:28 ` [PATCH v4 04/12] drm/xe: Use a single page-fault queue with multiple workers Matthew Brost
2026-05-06 15:46 ` Maciej Patelczyk
2026-05-06 19:42 ` Matthew Brost
2026-05-07 12:41 ` Maciej Patelczyk
2026-02-26 4:28 ` [PATCH v4 05/12] drm/xe: Add num_pf_work modparam Matthew Brost
2026-05-06 15:59 ` Maciej Patelczyk
2026-02-26 4:28 ` [PATCH v4 06/12] drm/xe: Engine class and instance into a u8 Matthew Brost
2026-05-06 16:04 ` Maciej Patelczyk
2026-05-07 16:20 ` Maciej Patelczyk
2026-02-26 4:28 ` [PATCH v4 07/12] drm/xe: Track pagefault worker runtime Matthew Brost
2026-05-07 12:51 ` Maciej Patelczyk
2026-02-26 4:28 ` [PATCH v4 08/12] drm/xe: Chain page faults via queue-resident cache to avoid fault storms Matthew Brost
2026-05-08 12:03 ` Maciej Patelczyk [this message]
2026-02-26 4:28 ` [PATCH v4 09/12] drm/xe: Add pagefault chaining stats Matthew Brost
2026-05-07 13:15 ` Maciej Patelczyk
2026-05-07 13:52 ` Francois Dugast
2026-02-26 4:28 ` [PATCH v4 10/12] drm/xe: Add debugfs pagefault_info Matthew Brost
2026-05-07 10:07 ` Maciej Patelczyk
2026-02-26 4:28 ` [PATCH v4 11/12] drm/xe: batch CT pagefault acks with periodic flush Matthew Brost
2026-05-08 9:24 ` Maciej Patelczyk
2026-02-26 4:28 ` [PATCH v4 12/12] drm/xe: Track parallel page fault activity in GT stats Matthew Brost
2026-05-07 13:56 ` Maciej Patelczyk
2026-05-07 14:23 ` Francois Dugast
2026-02-26 4:35 ` ✗ CI.checkpatch: warning for Fine grained fault locking, threaded prefetch, storm cache (rev4) Patchwork
2026-02-26 4:36 ` ✓ CI.KUnit: success " Patchwork
2026-02-26 5:26 ` ✗ Xe.CI.BAT: failure " Patchwork
2026-02-26 8:59 ` ✗ Xe.CI.FULL: " Patchwork
2026-02-26 13:43 ` [PATCH v4 00/12] Fine grained fault locking, threaded prefetch, storm cache Thomas Hellström
2026-02-26 19:36 ` Matthew Brost
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=08310532-c78a-4767-aea2-abc86877a156@intel.com \
--to=maciej.patelczyk@intel.com \
--cc=arvind.yadav@intel.com \
--cc=francois.dugast@intel.com \
--cc=himal.prasad.ghimiray@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.brost@intel.com \
--cc=stuart.summers@intel.com \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox