Re: [PATCH v4 08/12] drm/xe: Chain page faults via queue-resident cache to avoid fault storms

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Maciej Patelczyk <maciej.patelczyk@intel.com>
To: Matthew Brost <matthew.brost@intel.com>,
	<intel-xe@lists.freedesktop.org>
Cc: <stuart.summers@intel.com>, <arvind.yadav@intel.com>,
	<himal.prasad.ghimiray@intel.com>,
	<thomas.hellstrom@linux.intel.com>, <francois.dugast@intel.com>
Subject: Re: [PATCH v4 08/12] drm/xe: Chain page faults via queue-resident cache to avoid fault storms
Date: Fri, 8 May 2026 14:03:49 +0200	[thread overview]
Message-ID: <08310532-c78a-4767-aea2-abc86877a156@intel.com> (raw)
In-Reply-To: <20260226042834.2963245-9-matthew.brost@intel.com>

On 26/02/2026 05:28, Matthew Brost wrote:

> Some Xe platforms can generate pagefault storms where many faults target
> the same address range in a short time window (e.g. many EU threads
> faulting the same page). The current worker/locking model effectively
> serializes faults for a given range and repeatedly performs VMA/range
> lookups for each fault, which creates head-of-queue blocking and wastes
> CPU in the hot path.
>
> Introduce a page fault chaining cache that coalesces faults targeting
> the samr ASID and address range.
>
> Each worker tracks the active fault range it is servicing. Fault entries
> reside in stable queue storage, allowing the IRQ handler to match new
> faults against the worker cache and directly chain cache hits onto the
> active entry without allocation or waiting for dequeue. Once the leading
> fault completes, the worker acknowledges the entire chain.
>
> A small allocation state is added to each entry so queue, worker, and
> IRQ pathd can safely reference the same fault object. This prevents
> reuse while the fault is active and guarantees that chained faults
> remain valid until acknowledged.
>
> Fault handlers also record the serviced range so subsequent faults can
> be acknowledged without re-running the full resolution path.
>
> This removes repeated fault resolution during fault storms and
> significantly improves forward progress in SVM workloads.
>
> Assisted-by: Chat-GPT # Documentation
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/xe/xe_pagefault.c       | 434 +++++++++++++++++++++---
>   drivers/gpu/drm/xe/xe_pagefault.h       |  71 ++++
>   drivers/gpu/drm/xe/xe_pagefault_types.h |  78 +++--
>   drivers/gpu/drm/xe/xe_svm.c             |  16 +-
>   drivers/gpu/drm/xe/xe_svm.h             |   9 +-
>   5 files changed, 523 insertions(+), 85 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
> index 030452923ab9..9c14f9505faf 100644
> --- a/drivers/gpu/drm/xe/xe_pagefault.c
> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
> @@ -35,6 +35,70 @@
>    * xe_pagefault.c implements the consumer layer.
>    */
>   
> +/**
> + * DOC: Xe page fault cache
> + *
> + * Some Xe hardware can trigger “fault storms,” which are many page faults to
> + * the same address within a short period of time. An example is many EU threads
> + * faulting on the same page simultaneously. With the current page fault locking
> + * structure, only one page fault for a given address range can be processed at
> + * a time. This causes head-of-queue blocking across workers, killing
> + * parallelism. If the page fault handler must repeatedly look up resources
> + * (VMAs, ranges) to determine that the pages are valid for each fault in the
> + * storm, the time complexity grows rapidly.
> + *
> + * To address this, each page fault worker maintains a cache of the active fault
> + * being processed. Subsequent faults that hit in the cache are chained to the
> + * pending fault, and all chained faults are acknowledged once the initial fault
> + * completes. This alleviates head-of-queue blocking and quickly chains faults
> + * in the upper layers, avoiding expensive lookups in the main fault-handling
> + * path.
> + *
> + * Faults are buffered in the page fault queue in a way that provides stable
> + * storage for outstanding faults. In particular, faults may be chained directly
> + * while still resident in the queue storage (i.e., outside the worker’s current
> + * head/tail dequeue position). This allows the IRQ handler to match newly
> + * arrived faults against the per-worker cache and immediately chain cache hits
> + * onto the active fault under the queue lock, without allocating memory or
> + * waiting for the worker to pop the fault first.
> + *
> + * A per-fault state field is used to assert correctness of these invariants.
> + * The state tracks whether an entry is free, queued, chained, or currently
> + * active. Transitions are performed under the page fault queue lock, and the
> + * worker acknowledges faults by walking the chain and returning entries to the
> + * free state once they are complete.
> + */
> +
> +/**
> + * enum xe_pagefault_alloc_state - lifetime state for a page fault queue entry
> + * @XE_PAGEFAULT_ALLOC_STATE_FREE:
> + *	Entry is unused and may be overwritten by the producer, consumer retry
> + *	or requeue..
> + * @XE_PAGEFAULT_ALLOC_STATE_QUEUED:
> + *	Entry has been enqueued and may be dequeued by a worker.
> + * @XE_PAGEFAULT_ALLOC_STATE_ACTIVE:
> + *	Entry has been dequeued and is the worker's currently serviced fault.
> + *	The worker may attach additional faults to it via consumer.next.
> + * @XE_PAGEFAULT_ALLOC_STATE_CHAINED:
> + *	Entry is not independently serviced; it has been chained onto an
> + *	ACTIVE entry via consumer.next and will be acknowledged when the
> + *	leading fault completes.
> + *
> + * The page fault queue provides stable storage for outstanding faults so the
> + * IRQ handler can chain new cache hits directly onto a worker's active fault.
> + * Because entries may remain referenced outside the consumer dequeue window,
> + * the producer must only write into entries in the FREE state.
> + *
> + * State transitions are protected by the page fault queue lock. Workers return
> + * entries to FREE after acknowledging the fault (either as ACTIVE or CHAINED).
> + */
> +enum xe_pagefault_alloc_state {
> +	XE_PAGEFAULT_ALLOC_STATE_FREE		= 0,
> +	XE_PAGEFAULT_ALLOC_STATE_QUEUED		= 1,
> +	XE_PAGEFAULT_ALLOC_STATE_CHAINED	= 2,
> +	XE_PAGEFAULT_ALLOC_STATE_ACTIVE		= 3,
> +};
> +
>   static int xe_pagefault_entry_size(void)
>   {
>   	/*
> @@ -64,7 +128,7 @@ static int xe_pagefault_begin(struct drm_exec *exec, struct xe_vma *vma,
>   }
>   
>   static int xe_pagefault_handle_vma(struct xe_gt *gt, struct xe_vma *vma,
> -				   bool atomic)
> +				   struct xe_pagefault *pf, bool atomic)
>   {
>   	struct xe_vm *vm = xe_vma_vm(vma);
>   	struct xe_tile *tile = gt_to_tile(gt);
> @@ -89,8 +153,11 @@ static int xe_pagefault_handle_vma(struct xe_gt *gt, struct xe_vma *vma,
>   
>   	/* Check if VMA is valid, opportunistic check only */
>   	if (xe_vm_has_valid_gpu_mapping(tile, vma->tile_present,
> -					vma->tile_invalidated) && !atomic)
> +					vma->tile_invalidated) && !atomic) {
> +		xe_pagefault_set_start_addr(pf, xe_vma_start(vma));
> +		xe_pagefault_set_end_addr(pf, xe_vma_end(vma));
>   		return 0;
> +	}
>   
>   	do {
>   		if (xe_vma_is_userptr(vma) &&
> @@ -128,6 +195,10 @@ static int xe_pagefault_handle_vma(struct xe_gt *gt, struct xe_vma *vma,
>   	} while (err == -EAGAIN);
>   
>   	if (!err) {
> +		/* Give hint to immediately ack faults */
> +		xe_pagefault_set_start_addr(pf, xe_vma_start(vma));
> +		xe_pagefault_set_end_addr(pf, xe_vma_end(vma));
> +
>   		dma_fence_wait(fence, false);
>   		dma_fence_put(fence);
>   	}
> @@ -199,10 +270,10 @@ static int xe_pagefault_service(struct xe_pagefault *pf)
>   	atomic = xe_pagefault_access_is_atomic(pf->consumer.access_type);
>   
>   	if (xe_vma_is_cpu_addr_mirror(vma))
> -		err = xe_svm_handle_pagefault(vm, vma, gt,
> +		err = xe_svm_handle_pagefault(vm, vma, pf, gt,
>   					      pf->consumer.page_addr, atomic);
>   	else
> -		err = xe_pagefault_handle_vma(gt, vma, atomic);
> +		err = xe_pagefault_handle_vma(gt, vma, pf, atomic);
>   
>   unlock_vm:
>   	if (!err)
> @@ -214,33 +285,228 @@ static int xe_pagefault_service(struct xe_pagefault *pf)
>   	return err;
>   }
>   
> +#define XE_PAGEFAULT_CACHE_START_INVALID	U64_MAX
> +#define xe_pagefault_cache_start_invalidate(val)	\
> +	(val = XE_PAGEFAULT_CACHE_START_INVALID)
> +
> +static void
> +__xe_pagefault_cache_invalidate(struct xe_pagefault_queue *pf_queue,
> +				struct xe_pagefault_work *pf_work)
> +{
> +	lockdep_assert_held(&pf_queue->lock);
> +
> +	xe_pagefault_cache_start_invalidate(pf_work->cache.start);
> +}
> +
> +static void
> +xe_pagefault_cache_invalidate(struct xe_pagefault_queue *pf_queue,
> +			      struct xe_pagefault_work *pf_work)
> +{
> +	spin_lock_irq(&pf_queue->lock);
> +	__xe_pagefault_cache_invalidate(pf_queue, pf_work);
> +	spin_unlock_irq(&pf_queue->lock);
> +}
> +
> +static bool xe_pagefault_queue_full(struct xe_pagefault_queue *pf_queue)
> +{
> +	lockdep_assert_held(&pf_queue->lock);
> +
> +	return CIRC_SPACE(pf_queue->head, pf_queue->tail,
> +			  pf_queue->size) <= xe_pagefault_entry_size();
> +}
> +
> +static struct xe_pagefault *
> +__xe_pagefault_queue_add(struct xe_pagefault_queue *pf_queue,
> +			 struct xe_pagefault *pf)
> +{
> +	struct xe_device *xe = container_of(pf_queue, typeof(*xe),
> +					    usm.pf_queue);
> +	struct xe_pagefault *lpf;
> +
> +	lockdep_assert_held(&pf_queue->lock);
> +
> +	do {
> +		xe_assert(xe, !xe_pagefault_queue_full(pf_queue));
> +
> +		lpf = (pf_queue->data + pf_queue->head);
> +		pf_queue->head = (pf_queue->head + xe_pagefault_entry_size()) %
> +			pf_queue->size;
> +	} while (lpf->consumer.alloc_state != XE_PAGEFAULT_ALLOC_STATE_FREE);
> +
> +	memcpy(lpf, pf, sizeof(*pf));
> +	lpf->consumer.alloc_state = XE_PAGEFAULT_ALLOC_STATE_QUEUED;
> +
> +	return lpf;
> +}
> +
>   static void xe_pagefault_queue_retry(struct xe_pagefault_queue *pf_queue,
> -				     struct xe_pagefault *pf)
> +				     struct xe_pagefault *pf,
> +				     struct xe_pagefault_work *pf_work)
>   {
> +	struct xe_device *xe = container_of(pf_queue, typeof(*xe),
> +					    usm.pf_queue);
> +
> +	xe_assert(xe, pf->consumer.alloc_state ==
> +		  XE_PAGEFAULT_ALLOC_STATE_ACTIVE);
> +
>   	spin_lock_irq(&pf_queue->lock);
> -	if (!pf_queue->tail)
> -		pf_queue->tail = pf_queue->size - xe_pagefault_entry_size();
> -	else
> -		pf_queue->tail -= xe_pagefault_entry_size();
> -	memcpy(pf_queue->data + pf_queue->tail, pf, sizeof(*pf));
> +	pf->consumer.alloc_state = XE_PAGEFAULT_ALLOC_STATE_FREE;
> +	__xe_pagefault_queue_add(pf_queue, pf);
> +	__xe_pagefault_cache_invalidate(pf_queue, pf_work);
>   	spin_unlock_irq(&pf_queue->lock);
>   }
>   
> -static bool xe_pagefault_queue_pop(struct xe_pagefault_queue *pf_queue,
> -				   struct xe_pagefault *pf)
> +static struct xe_pagefault *
> +xe_pagefault_queue_requeue(struct xe_pagefault_queue *pf_queue,
> +			   struct xe_pagefault *pf, struct xe_gt *gt)
This is specifically for chained entries and it's _unchain_and_requeue 
actually.
>   {
> -	bool found_fault = false;
> +	struct xe_device *xe = container_of(pf_queue, typeof(*xe),
> +					    usm.pf_queue);
> +	struct xe_pagefault *next = pf->consumer.next, *lpf;
> +
> +	xe_assert(xe, pf->consumer.alloc_state ==
> +		  XE_PAGEFAULT_ALLOC_STATE_CHAINED);
>   
>   	spin_lock_irq(&pf_queue->lock);
> -	if (pf_queue->tail != pf_queue->head) {
> -		memcpy(pf, pf_queue->data + pf_queue->tail, sizeof(*pf));
> -		pf_queue->tail = (pf_queue->tail + xe_pagefault_entry_size()) %
> -			pf_queue->size;
> -		found_fault = true;
> -	}
> +	pf->consumer.alloc_state = XE_PAGEFAULT_ALLOC_STATE_FREE;
> +	lpf = __xe_pagefault_queue_add(pf_queue, pf);
> +	lpf->consumer.next = NULL;
> +	lpf->consumer.fault_type_level |= XE_PAGEFAULT_REQUEUE_MASK;
>   	spin_unlock_irq(&pf_queue->lock);
>   
> -	return found_fault;
> +	return next;
> +}
> +
> +static bool xe_pagefault_cache_match(struct xe_pagefault *pf, u64 start,
> +				     u64 end, u64 cache_asid)
Maybe without 'cache' since the comparison is for function's arguments.
> +{
> +	struct xe_device *xe = gt_to_xe(pf->gt);
> +	u64 page_addr = pf->consumer.page_addr;
> +	u32 pf_asid = pf->consumer.asid;
> +
> +	xe_assert(xe, pf->consumer.alloc_state !=
> +		  XE_PAGEFAULT_ALLOC_STATE_FREE);
> +
> +	return page_addr >= start && page_addr < end &&
> +		pf_asid == cache_asid;
> +}
> +
> +static bool xe_pagefault_cache_hit(struct xe_pagefault_queue *pf_queue,
> +				   struct xe_pagefault *pf)

I would say that the function chains entry to worker's active chain.

I think you could rename it to xe_pagefault_entry_chained() or similar.

 From 'cache_hit' I expect just the answer if it matches to any cache.

> +{
> +	struct xe_device *xe = container_of(pf_queue, typeof(*xe),
> +					    usm.pf_queue);
> +	struct xe_pagefault_work *pf_work;
> +	bool requeue = FIELD_GET(XE_PAGEFAULT_REQUEUE_MASK,
> +				 pf->consumer.fault_type_level);
> +	int i;
> +
> +	lockdep_assert_held(&pf_queue->lock);
> +	xe_assert(xe, pf->consumer.alloc_state ==
> +		  XE_PAGEFAULT_ALLOC_STATE_QUEUED);
> +
> +	/*
> +	 * If this is a retry, we may already have a chain attached. In that
> +	 * case, we cannot hit in the cache because chains cannot easily be
> +	 * combined.
> +	 */
> +	if (pf->consumer.next)
> +		return false;
> +
> +	for (i = 0, pf_work = xe->usm.pf_workers;
> +	     i < xe->info.num_pf_work; ++i, ++pf_work) {
> +		u64 start = pf_work->cache.start;
> +		u64 end = requeue ? start + SZ_4K : pf_work->cache.end;
> +		u32 asid = pf_work->cache.asid;
> +
> +		if (xe_pagefault_cache_match(pf, start, end, asid)) {
> +			xe_assert(xe, pf_work->cache.pf->consumer.alloc_state ==
> +				  XE_PAGEFAULT_ALLOC_STATE_ACTIVE);
> +
> +			pf->consumer.alloc_state =
> +				XE_PAGEFAULT_ALLOC_STATE_CHAINED;
> +			pf->consumer.next = pf_work->cache.pf->consumer.next;
> +			pf_work->cache.pf->consumer.next = pf;
> +
> +			return true;
> +		}
> +	}
> +
> +	return false;
> +}
> +
> +static void xe_pagefault_queue_advance(struct xe_pagefault_queue *pf_queue)
> +{
> +	lockdep_assert_held(&pf_queue->lock);
> +
> +	pf_queue->tail = (pf_queue->tail + xe_pagefault_entry_size()) %
> +		pf_queue->size;
> +}
> +
> +static bool xe_pagefault_queue_pop(struct xe_pagefault_queue *pf_queue,
> +				   struct xe_pagefault **pf, int id)
> +{
> +	struct xe_device *xe = container_of(pf_queue, typeof(*xe),
> +					    usm.pf_queue);
> +	struct xe_pagefault_work *pf_work;
> +	struct xe_pagefault *lpf;
> +	size_t align = SZ_2M;
> +
> +	guard(spinlock_irq)(&pf_queue->lock);
> +
> +	for (*pf = NULL; !*pf;) {
> +		if (pf_queue->tail == pf_queue->head)
> +			return false;
xe_pagefault_queue_empty(pf_queue)?
> +
> +		lpf = (pf_queue->data + pf_queue->tail);

Maybe add a helper

static inline struct xe_pagefault *xe_pagefault_queue_get_tail_locked(struct xe_pagefault_queue *pf_queue)
{
	return pf_queue->data + pf_queue->tail;
}

Getting tail is used more than once in a code.

> +		xe_pagefault_queue_advance(pf_queue);
> +
> +		if (lpf->consumer.alloc_state !=
> +		    XE_PAGEFAULT_ALLOC_STATE_QUEUED)
> +			continue;
> +
> +		if (xe_pagefault_cache_hit(pf_queue, lpf))
> +			continue;

This chains STATE_QUEUED entry into matching worker.

> +
> +		*pf = lpf;	/* Hand back page fault for processing */
> +	}
> +
> +	/*
> +	 * No cache hit; allocate a new cache entry. We assume most faults
> +	 * within a 2M range will hit the same pages. If this assumption proves
> +	 * false, the mismatched fault is requeued after the initial fault is
> +	 * acknowledged.
> +	 */
> +	pf_work = xe->usm.pf_workers + id;
> +	if (FIELD_GET(XE_PAGEFAULT_REQUEUE_MASK,
> +		      lpf->consumer.fault_type_level))
> +		align = SZ_4K;
> +	pf_work->cache.start = ALIGN_DOWN(lpf->consumer.page_addr, align);
> +	pf_work->cache.end = pf_work->cache.start + align;
> +	pf_work->cache.asid = lpf->consumer.asid;
> +	pf_work->cache.pf = lpf;
> +	lpf->consumer.alloc_state = XE_PAGEFAULT_ALLOC_STATE_ACTIVE;
> +
> +	/* Drain queue until empty or new fault found */
> +	while (1) {
> +		if (pf_queue->tail == pf_queue->head)
> +			break;
> +
> +		lpf = (pf_queue->data + pf_queue->tail);
> +
> +		if (lpf->consumer.alloc_state !=
> +		    XE_PAGEFAULT_ALLOC_STATE_QUEUED) {
> +			xe_pagefault_queue_advance(pf_queue);
> +			continue;
> +		}
> +
> +		if (!xe_pagefault_cache_hit(pf_queue, lpf))
> +			break;
> +
> +		xe_pagefault_queue_advance(pf_queue);
> +	}
> +
In this while(1) initial chaining begins. If there are queued pagefaults 
before any worker starts in this loop they will be chained for the first 
time.
> +	return true;
>   }
>   
>   static void xe_pagefault_print(struct xe_pagefault *pf)
> @@ -276,40 +542,91 @@ static void xe_pagefault_queue_work(struct work_struct *w)
>   		container_of(w, typeof(*pf_work), work);
>   	struct xe_device *xe = pf_work->xe;
>   	struct xe_pagefault_queue *pf_queue = &xe->usm.pf_queue;
> -	struct xe_pagefault pf;
> +	struct xe_pagefault *pf;
>   	ktime_t start = xe_gt_stats_ktime_get();
> -	struct xe_gt *gt = NULL;
>   	unsigned long threshold;
> +	u64 cache_start = XE_PAGEFAULT_CACHE_START_INVALID, cache_end = 0;
> +	u32 cache_asid = 0;
>   
>   #define USM_QUEUE_MAX_RUNTIME_MS      20
>   	threshold = jiffies + msecs_to_jiffies(USM_QUEUE_MAX_RUNTIME_MS);
>   
> -	while (xe_pagefault_queue_pop(pf_queue, &pf)) {
> -		int err;
>   
> -		if (!pf.gt)	/* Fault squashed during reset */
> -			continue;
> +	while (xe_pagefault_queue_pop(pf_queue, &pf, pf_work->id)) {
> +		struct xe_gt *gt = pf->gt;
> +		u32 asid = pf->consumer.asid;
> +		int err = 0;
> +
> +		/* Last fault same address, ack immediately */
> +		if (xe_pagefault_cache_match(pf, cache_start, cache_end,
> +					     cache_asid))
> +			goto ack_fault;
>   
> -		gt = pf.gt;
> -		err = xe_pagefault_service(&pf);
> +		err = xe_pagefault_service(pf);
>   
>   		if (err == -EAGAIN) {
> -			xe_pagefault_queue_retry(pf_queue, &pf);
> +			xe_pagefault_queue_retry(pf_queue, pf, pf_work);
>   			queue_work(xe->usm.pf_wq, w);
>   			break;
> -		} else if (err) {
> -			if (!(pf.consumer.access_type & XE_PAGEFAULT_ACCESS_PREFETCH)) {
> -				xe_pagefault_print(&pf);
> -				xe_gt_info(pf.gt, "Fault response: Unsuccessful %pe\n",
> +		} else if (err ) {
Checkpatch will catch it anyway. Space after err.
> +			if (!(pf->consumer.access_type & XE_PAGEFAULT_ACCESS_PREFETCH)) {
> +				xe_pagefault_cache_start_invalidate(cache_start);
> +				xe_pagefault_print(pf);
> +				xe_gt_info(pf->gt, "Fault response: Unsuccessful %pe\n",
>   					   ERR_PTR(err));
>   			} else {
> -				xe_gt_stats_incr(pf.gt, XE_GT_STATS_ID_INVALID_PREFETCH_PAGEFAULT_COUNT, 1);
> -				xe_gt_dbg(pf.gt, "Prefetch Fault response: Unsuccessful %pe\n",
> +				xe_gt_stats_incr(pf->gt, XE_GT_STATS_ID_INVALID_PREFETCH_PAGEFAULT_COUNT, 1);
> +				xe_gt_dbg(pf->gt, "Prefetch Fault response: Unsuccessful %pe\n",
>   					  ERR_PTR(err));
>   			}
> +		} else {
> +			/* Cache valid fault locally */
> +			cache_start = xe_pagefault_start_addr(pf);
> +			cache_end = xe_pagefault_end_addr(pf);
> +			cache_asid = asid;
>   		}
>   
> -		pf.producer.ops->ack_fault(&pf, err);
> +ack_fault:
> +		xe_assert(xe, pf->consumer.alloc_state ==
> +			  XE_PAGEFAULT_ALLOC_STATE_ACTIVE);
> +		xe_assert(xe, pf == pf_work->cache.pf);
> +
> +		while (pf) {
> +			struct xe_pagefault *next;
> +
> +			xe_assert(xe, pf->consumer.alloc_state ==
> +				  XE_PAGEFAULT_ALLOC_STATE_CHAINED ||
> +				  pf->consumer.alloc_state ==
> +				  XE_PAGEFAULT_ALLOC_STATE_ACTIVE);
> +
> +			pf->producer.ops->ack_fault(pf, err);
> +
> +			if (pf->consumer.alloc_state ==
> +			    XE_PAGEFAULT_ALLOC_STATE_ACTIVE)
> +				xe_pagefault_cache_invalidate(pf_queue,
> +							      pf_work);

This gives me a lot of thinking.

Why to invalidate the cache now? The xe_pagefault_handler tries to chain 
incoming entries with existing chain but it has limitations

  I think a lock could be grabbed here or above since the whole section 
here will become protected.

> +
> +			/*
> +			 * Removed from the cache, so next is stable within this
> +			 * chain. Once alloc_state transitions to
> +			 * XE_PAGEFAULT_ALLOC_STATE_FREE, the local entry must
> +			 * not be touched.
> +			 */
> +			next = pf->consumer.next;
> +			WRITE_ONCE(pf->consumer.alloc_state,
> +				   XE_PAGEFAULT_ALLOC_STATE_FREE);
> +			pf = next;
> +
> +			/*
> +			 * Requeue chained faults which do not match the last
> +			 * fault processed
> +			 */
> +			while (pf && !xe_pagefault_cache_match(pf, cache_start,
> +							       cache_end,
> +							       cache_asid))
> +				pf = xe_pagefault_queue_requeue(pf_queue, pf,
> +								gt);

After 'while' if pf then update the pf->consumer.alloc_state to ACTIVE, 
update the worker cache to pf

then release the lock. xe_pagefault_queue_requeue becomes _locked then.

This way the handler could add new entries while worker is active.


Or potentially queue_pop could catch later on such loose entry and chain 
it...


Second thing. I'm not sure I understand this requeue action.

Is it for instance because initially all 2MB range pagefaults were 
chained and after xe_pagefault_service() the range was narrowed down to 4k?

If so then some of those pagefaults should really be unchained and requeued.

> +		}
>   
>   		if (time_after(jiffies, threshold)) {
>   			queue_work(xe->usm.pf_wq, w);
> @@ -318,10 +635,8 @@ static void xe_pagefault_queue_work(struct work_struct *w)
>   	}
>   #undef USM_QUEUE_MAX_RUNTIME_MS
>   
> -	if (gt)
> -		xe_gt_stats_incr(xe_root_mmio_gt(gt_to_xe(gt)),
> -				 XE_GT_STATS_ID_PAGEFAULT_US,
> -				 xe_gt_stats_ktime_us_delta(start));
> +	xe_gt_stats_incr(xe_root_mmio_gt(xe), XE_GT_STATS_ID_PAGEFAULT_US,
> +			 xe_gt_stats_ktime_us_delta(start));
>   }
>   
>   static int xe_pagefault_queue_init(struct xe_device *xe,
> @@ -408,6 +723,7 @@ int xe_pagefault_init(struct xe_device *xe)
>   
>   		pf_work->xe = xe;
>   		pf_work->id = i;
> +		xe_pagefault_cache_start_invalidate(pf_work->cache.start);
>   		INIT_WORK(&pf_work->work, xe_pagefault_queue_work);
>   	}
>   
> @@ -430,12 +746,15 @@ static void xe_pagefault_queue_reset(struct xe_device *xe, struct xe_gt *gt,
>   	/* Squash all pending faults on the GT */
>   
>   	spin_lock_irq(&pf_queue->lock);
> -	for (i = pf_queue->tail; i != pf_queue->head;
> -	     i = (i + xe_pagefault_entry_size()) % pf_queue->size) {
> +	for (i = 0; i < pf_queue->size; i += xe_pagefault_entry_size()) {
>   		struct xe_pagefault *pf = pf_queue->data + i;
>   
> -		if (pf->gt == gt)
> -			pf->gt = NULL;
> +		if (pf->gt != gt)
> +			continue;
> +
> +		pf->consumer.alloc_state =
> +			XE_PAGEFAULT_ALLOC_STATE_FREE;
> +		pf->consumer.next = NULL;
>   	}
>   	spin_unlock_irq(&pf_queue->lock);
>   }
> @@ -453,12 +772,11 @@ void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt)
>   	xe_pagefault_queue_reset(xe, gt, &xe->usm.pf_queue);
>   }
>   
> -static bool xe_pagefault_queue_full(struct xe_pagefault_queue *pf_queue)
> +static bool xe_pagefault_queue_empty(struct xe_pagefault_queue *pf_queue)
>   {
>   	lockdep_assert_held(&pf_queue->lock);
>   
> -	return CIRC_SPACE(pf_queue->head, pf_queue->tail, pf_queue->size) <=
> -		xe_pagefault_entry_size();
> +	return pf_queue->head == pf_queue->tail;
>   }
>   
>   /*
> @@ -486,18 +804,26 @@ int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf)
>   {
>   	struct xe_pagefault_queue *pf_queue = &xe->usm.pf_queue;
>   	unsigned long flags;
> -	int work_index;
>   	bool full;
>   
>   	spin_lock_irqsave(&pf_queue->lock, flags);
> -	work_index = xe_pagefault_work_index(xe);
>   	full = xe_pagefault_queue_full(pf_queue);
>   	if (!full) {
> -		memcpy(pf_queue->data + pf_queue->head, pf, sizeof(*pf));
> -		pf_queue->head = (pf_queue->head + xe_pagefault_entry_size()) %
> -			pf_queue->size;
> -		queue_work(xe->usm.pf_wq,
> -			   &xe->usm.pf_workers[work_index].work);
> +		struct xe_pagefault *lpf;
> +		bool empty = xe_pagefault_queue_empty(pf_queue);
> +
> +		lpf = __xe_pagefault_queue_add(pf_queue, pf);
> +		lpf->consumer.next = NULL;
> +
> +		if (xe_pagefault_cache_hit(pf_queue, lpf)) {
> +			if (empty)
> +				xe_pagefault_queue_advance(pf_queue);
> +		} else {
> +			int work_index = xe_pagefault_work_index(xe);
> +
> +			queue_work(xe->usm.pf_wq,
> +				   &xe->usm.pf_workers[work_index].work);
> +		}
>   	} else {
>   		drm_warn(&xe->drm,
>   			 "PageFault Queue full, shouldn't be possible\n");
> diff --git a/drivers/gpu/drm/xe/xe_pagefault.h b/drivers/gpu/drm/xe/xe_pagefault.h
> index bd0cdf9ed37f..feaf2a69674a 100644
> --- a/drivers/gpu/drm/xe/xe_pagefault.h
> +++ b/drivers/gpu/drm/xe/xe_pagefault.h
> @@ -6,6 +6,8 @@
>   #ifndef _XE_PAGEFAULT_H_
>   #define _XE_PAGEFAULT_H_
>   
> +#include "xe_pagefault_types.h"
> +
>   struct xe_device;
>   struct xe_gt;
>   struct xe_pagefault;
> @@ -16,4 +18,73 @@ void xe_pagefault_reset(struct xe_device *xe, struct xe_gt *gt);
>   
>   int xe_pagefault_handler(struct xe_device *xe, struct xe_pagefault *pf);
>   
> +#define XE_PAGEFAULT_END_ADDR_MASK	(~0xfffull)
> +
> +/**
> + * xe_pagefault_set_end_addr() - store serviced range end for a pagefault
> + * @pf: Pagefault entry
> + * @end_addr: Inclusive end address of the serviced fault range
> + *
> + * The pagefault consumer stores the resolved fault range so subsequent faults
> + * hitting the same range can be immediately acknowledged without re-running
> + * the full fault handling path.
> + *
> + * The end address shares storage with other consumer metadata and therefore
> + * must be masked with %XE_PAGEFAULT_END_ADDR_MASK before storing. Bits outside
> + * the mask are reserved for internal state tracking and must be preserved.
> + */
> +static inline void
> +xe_pagefault_set_end_addr(struct xe_pagefault *pf, u64 end_addr)
> +{
> +	pf->consumer.end_addr &= ~XE_PAGEFAULT_END_ADDR_MASK;
> +	pf->consumer.end_addr |= end_addr;
> +}
> +
> +/**
> + * xe_pagefault_end_addr() - read serviced range end for a pagefault
> + * @pf: Pagefault entry
> + *
> + * Returns the inclusive end address of the range previously recorded by
> + * xe_pagefault_set_end_addr(). Only the bits covered by
> + * %XE_PAGEFAULT_END_ADDR_MASK are returned; other bits in the storage are
> + * reserved for internal state.
> + *
> + * Return: End address of the serviced fault range.
> + */
> +static inline u64 xe_pagefault_end_addr(struct xe_pagefault *pf)
> +{
> +	return pf->consumer.end_addr & XE_PAGEFAULT_END_ADDR_MASK;
> +}
> +
> +#undef XE_PAGEFAULT_END_ADDR_MASK
> +
> +/**
> + * xe_pagefault_set_start_addr() - store serviced range start for a pagefault
> + * @pf: Pagefault entry
> + * @start_addr: Start address of the serviced fault range
> + *
> + * The pagefault consumer stores the resolved fault range so subsequent faults
> + * hitting the same range can be immediately acknowledged without re-running
> + * the full fault handling path.
> + */
> +static inline void
> +xe_pagefault_set_start_addr(struct xe_pagefault *pf, u64 start_addr)
> +{
> +	pf->consumer.page_addr = start_addr;
> +}
> +
> +/**
> + * xe_pagefault_start_addr() - read serviced range start for a pagefault
> + * @pf: Pagefault entry
> + *
> + * Returns the inclusive start address of the range previously recorded by
> + * xe_pagefault_set_start_addr().
> + *
> + * Return: Start address of the serviced fault range.
> + */
> +static inline u64 xe_pagefault_start_addr(struct xe_pagefault *pf)
> +{
> +	return pf->consumer.page_addr;
> +}
> +
>   #endif
> diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h b/drivers/gpu/drm/xe/xe_pagefault_types.h
> index 75bc53205601..57cb292105d7 100644
> --- a/drivers/gpu/drm/xe/xe_pagefault_types.h
> +++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
> @@ -60,36 +60,55 @@ struct xe_pagefault {
>   	/**
>   	 * @consumer: State for the software handling the fault. Populated by
>   	 * the producer and may be modified by the consumer to communicate
> -	 * information back to the producer upon fault acknowledgment.
> +	 * information back to the producer upon fault acknowledgment. After
> +	 * fault acknowledgment, the producer should only access consumer fields
> +	 * via well defined helpers.
>   	 */
>   	struct {
> -		/** @consumer.page_addr: address of page fault */
> -		u64 page_addr;
> -		/** @consumer.asid: address space ID */
> -		u32 asid;
>   		/**
> -		 * @consumer.access_type: access type and prefetch flag packed
> -		 * into a u8.
> +		 * @consumer.page_addr: address of page fault, populated by
> +		 * consumer after fault completion
>   		 */
> -		u8 access_type;
> +		u64 page_addr;
> +		union {
> +			struct {
> +				/** @alloc_state: page fault allocation state */
> +				u8 alloc_state;
> +				/**
> +				 * @consumer.access_type: access type, u8 rather
> +				 * than enum to keep size compact
> +				 */
> +				u8 access_type;
>   #define XE_PAGEFAULT_ACCESS_TYPE_MASK	GENMASK(1, 0)
>   #define XE_PAGEFAULT_ACCESS_PREFETCH	BIT(7)
> -		/**
> -		 * @consumer.fault_type_level: fault type and level, u8 rather
> -		 * than enum to keep size compact
> -		 */
> -		u8 fault_type_level;
> +				/**
> +				 * @consumer.fault_type_level: fault type and
> +				 * level, u8 rather than enum to keep size
> +				 * compact
> +				 */
> +				u8 fault_type_level;
>   #define XE_PAGEFAULT_TYPE_LEVEL_NACK		0xff	/* Producer indicates nack fault */
> -#define XE_PAGEFAULT_LEVEL_MASK			GENMASK(3, 0)
> -#define XE_PAGEFAULT_TYPE_MASK			GENMASK(7, 4)
> -		/** @consumer.engine_class_instance: engine class and instance */
> -		u8 engine_class_instance;
> +#define XE_PAGEFAULT_LEVEL_MASK			GENMASK(2, 0)
> +#define XE_PAGEFAULT_TYPE_MASK			GENMASK(6, 3)
> +#define XE_PAGEFAULT_REQUEUE_MASK		BIT(7)
> +				/** @consumer.engine_class_instance: engine class and instance */
> +				u8 engine_class_instance;
>   #define XE_PAGEFAULT_ENGINE_CLASS_MASK		GENMASK(3, 0)
>   #define XE_PAGEFAULT_ENGINE_INSTANCE_MASK	GENMASK(7, 4)
> -		/** @pad: alignment padding */
> -		u8 pad;
> -		/** consumer.reserved: reserved bits for future expansion */
> -		u64 reserved;
> +				/** @consumer.asid: address space ID */
> +				u32 asid;
> +			};
> +			/**
> +			 * @consumer.end_addr: end address of page fault,
> +			 * populated by consumer after fault completion
> +			 */
> +			u64 end_addr;
> +		};
> +		/**
> +		 * @consumer.next: next pagefault chained to this fault,
> +		 * protected by pf_queue lock
> +		 */
> +		struct xe_pagefault *next;
>   	} consumer;
>   	/**
>   	 * @producer: State for the producer (i.e., HW/FW interface). Populated
> @@ -131,7 +150,7 @@ struct xe_pagefault_queue {
>   	u32 head;
>   	/** @tail: Tail pointer in bytes, moved by consumer, protected by @lock */
>   	u32 tail;
> -	/** @lock: protects page fault queue */
> +	/** @lock: protects page fault queue, workers caches */
>   	spinlock_t lock;
>   };
>   
> @@ -146,6 +165,21 @@ struct xe_pagefault_work {
>   	struct xe_device *xe;
>   	/** @id: Identifier for this work item */
>   	int id;
> +	/**
> +	 * @cache: Page fault cache for the currently processed fault
> +	 *
> +	 * Protected by the page fault queue lock.
> +	 */
> +	struct {
> +		/** @cache.start: Start address of the current page fault */
> +		u64 start;
> +		/** @cache.end: End address of the current page fault */
> +		u64 end;
> +		/** @cache.asid: Address space ID of the current page fault */
> +		u32 asid;
> +		/** @cache.pf: Pointer to the current page fault */
> +		struct xe_pagefault *pf;
> +	} cache;
>   	/** @work: Work item used to process the page fault */
>   	struct work_struct work;
>   };
> diff --git a/drivers/gpu/drm/xe/xe_svm.c b/drivers/gpu/drm/xe/xe_svm.c
> index 66eee490a0c3..fc439fd85187 100644
> --- a/drivers/gpu/drm/xe/xe_svm.c
> +++ b/drivers/gpu/drm/xe/xe_svm.c
> @@ -15,6 +15,7 @@
>   #include "xe_gt_stats.h"
>   #include "xe_migrate.h"
>   #include "xe_module.h"
> +#include "xe_pagefault.h"
>   #include "xe_pm.h"
>   #include "xe_pt.h"
>   #include "xe_svm.h"
> @@ -1215,8 +1216,8 @@ DECL_SVM_RANGE_US_STATS(bind, BIND)
>   DECL_SVM_RANGE_US_STATS(fault, PAGEFAULT)
>   
>   static int __xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> -				     struct xe_gt *gt, u64 fault_addr,
> -				     bool need_vram)
> +				     struct xe_pagefault *pf, struct xe_gt *gt,
> +				     u64 fault_addr, bool need_vram)
>   {
>   	int devmem_possible = IS_DGFX(vm->xe) &&
>   		IS_ENABLED(CONFIG_DRM_XE_PAGEMAP);
> @@ -1372,6 +1373,10 @@ static int __xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>   	xe_svm_range_bind_us_stats_incr(gt, range, bind_start);
>   
>   out:
> +	/* Give hint to immediately ack faults */
> +	xe_pagefault_set_start_addr(pf,  xe_svm_range_start(range));
> +	xe_pagefault_set_end_addr(pf, xe_svm_range_end(range));
> +
>   	xe_svm_range_fault_us_stats_incr(gt, range, start);
>   	mutex_unlock(&range->lock);
>   	drm_gpusvm_range_put(&range->base);
> @@ -1394,6 +1399,7 @@ static int __xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>    * xe_svm_handle_pagefault() - SVM handle page fault
>    * @vm: The VM.
>    * @vma: The CPU address mirror VMA.
> + * @pf: Pagefault structure
>    * @gt: The gt upon the fault occurred.
>    * @fault_addr: The GPU fault address.
>    * @atomic: The fault atomic access bit.
> @@ -1404,8 +1410,8 @@ static int __xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>    * Return: 0 on success, negative error code on error.
>    */
>   int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> -			    struct xe_gt *gt, u64 fault_addr,
> -			    bool atomic)
> +			    struct xe_pagefault *pf, struct xe_gt *gt,
> +			    u64 fault_addr, bool atomic)
>   {
>   	int need_vram, ret;
>   retry:
> @@ -1413,7 +1419,7 @@ int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
>   	if (need_vram < 0)
>   		return need_vram;
>   
> -	ret =  __xe_svm_handle_pagefault(vm, vma, gt, fault_addr,
> +	ret =  __xe_svm_handle_pagefault(vm, vma, pf, gt, fault_addr,
>   					 need_vram ? true : false);
>   	if (ret == -EAGAIN) {
>   		/*
> diff --git a/drivers/gpu/drm/xe/xe_svm.h b/drivers/gpu/drm/xe/xe_svm.h
> index ebcca34f7f4d..07be92579971 100644
> --- a/drivers/gpu/drm/xe/xe_svm.h
> +++ b/drivers/gpu/drm/xe/xe_svm.h
> @@ -21,6 +21,7 @@ struct drm_file;
>   struct xe_bo;
>   struct xe_gt;
>   struct xe_device;
> +struct xe_pagefault;
>   struct xe_vram_region;
>   struct xe_tile;
>   struct xe_vm;
> @@ -107,8 +108,8 @@ void xe_svm_fini(struct xe_vm *vm);
>   void xe_svm_close(struct xe_vm *vm);
>   
>   int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> -			    struct xe_gt *gt, u64 fault_addr,
> -			    bool atomic);
> +			    struct xe_pagefault *pf, struct xe_gt *gt,
> +			    u64 fault_addr, bool atomic);
>   
>   bool xe_svm_has_mapping(struct xe_vm *vm, u64 start, u64 end);
>   
> @@ -296,8 +297,8 @@ void xe_svm_close(struct xe_vm *vm)
>   
>   static inline
>   int xe_svm_handle_pagefault(struct xe_vm *vm, struct xe_vma *vma,
> -			    struct xe_gt *gt, u64 fault_addr,
> -			    bool atomic)
> +			    struct xe_pagefault *pf, struct xe_gt *gt,
> +			    u64 fault_addr, bool atomic)
>   {
>   	return 0;
>   }

This is huge and impressive work! The patch and the whole series.

I can't say I'm done with the review.  Will continue it as need to get 
the understanding of it for

Intel Xe GPU Debug Support (eudebug) 
(https://patchwork.freedesktop.org/series/165777/)


Regards,

Maciej

next prev parent reply	other threads:[~2026-05-08 12:04 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-26  4:28 [PATCH v4 00/12] Fine grained fault locking, threaded prefetch, storm cache Matthew Brost
2026-02-26  4:28 ` [PATCH v4 01/12] drm/xe: Fine grained page fault locking Matthew Brost
2026-02-26  4:28 ` [PATCH v4 02/12] drm/xe: Allow prefetch-only VM bind IOCTLs to use VM read lock Matthew Brost
2026-02-26  4:28 ` [PATCH v4 03/12] drm/xe: Thread prefetch of SVM ranges Matthew Brost
2026-02-26  4:28 ` [PATCH v4 04/12] drm/xe: Use a single page-fault queue with multiple workers Matthew Brost
2026-05-06 15:46   ` Maciej Patelczyk
2026-05-06 19:42     ` Matthew Brost
2026-05-07 12:41       ` Maciej Patelczyk
2026-02-26  4:28 ` [PATCH v4 05/12] drm/xe: Add num_pf_work modparam Matthew Brost
2026-05-06 15:59   ` Maciej Patelczyk
2026-02-26  4:28 ` [PATCH v4 06/12] drm/xe: Engine class and instance into a u8 Matthew Brost
2026-05-06 16:04   ` Maciej Patelczyk
2026-05-07 16:20     ` Maciej Patelczyk
2026-02-26  4:28 ` [PATCH v4 07/12] drm/xe: Track pagefault worker runtime Matthew Brost
2026-05-07 12:51   ` Maciej Patelczyk
2026-02-26  4:28 ` [PATCH v4 08/12] drm/xe: Chain page faults via queue-resident cache to avoid fault storms Matthew Brost
2026-05-08 12:03   ` Maciej Patelczyk [this message]
2026-02-26  4:28 ` [PATCH v4 09/12] drm/xe: Add pagefault chaining stats Matthew Brost
2026-05-07 13:15   ` Maciej Patelczyk
2026-05-07 13:52     ` Francois Dugast
2026-02-26  4:28 ` [PATCH v4 10/12] drm/xe: Add debugfs pagefault_info Matthew Brost
2026-05-07 10:07   ` Maciej Patelczyk
2026-02-26  4:28 ` [PATCH v4 11/12] drm/xe: batch CT pagefault acks with periodic flush Matthew Brost
2026-05-08  9:24   ` Maciej Patelczyk
2026-02-26  4:28 ` [PATCH v4 12/12] drm/xe: Track parallel page fault activity in GT stats Matthew Brost
2026-05-07 13:56   ` Maciej Patelczyk
2026-05-07 14:23     ` Francois Dugast
2026-02-26  4:35 ` ✗ CI.checkpatch: warning for Fine grained fault locking, threaded prefetch, storm cache (rev4) Patchwork
2026-02-26  4:36 ` ✓ CI.KUnit: success " Patchwork
2026-02-26  5:26 ` ✗ Xe.CI.BAT: failure " Patchwork
2026-02-26  8:59 ` ✗ Xe.CI.FULL: " Patchwork
2026-02-26 13:43 ` [PATCH v4 00/12] Fine grained fault locking, threaded prefetch, storm cache Thomas Hellström
2026-02-26 19:36   ` Matthew Brost

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=08310532-c78a-4767-aea2-abc86877a156@intel.com \
    --to=maciej.patelczyk@intel.com \
    --cc=arvind.yadav@intel.com \
    --cc=francois.dugast@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=stuart.summers@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox