Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
To: Matthew Brost <matthew.brost@intel.com>,
	Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <simona.vetter@ffwll.ch>,
	<christian.koenig@amd.com>, <thomas.hellstrom@linux.intel.com>,
	<joonas.lahtinen@linux.intel.com>,
	<christoph.manszewski@intel.com>, <rodrigo.vivi@intel.com>,
	<andrzej.hajda@intel.com>, <matthew.auld@intel.com>,
	<maciej.patelczyk@intel.com>
Subject: Re: [PATCH 22/22] drm/xe/eudebug: Enable EU pagefault handling
Date: Fri, 27 Feb 2026 14:11:53 -0800	[thread overview]
Message-ID: <3d7dde05-c55b-4966-b87c-f72afebdda7d@intel.com> (raw)
In-Reply-To: <aZyfQaq2SkgMn42Q@lstrano-desk.jf.intel.com>



On 2/23/26 10:41 AM, Matthew Brost wrote:
> On Mon, Feb 23, 2026 at 04:03:17PM +0200, Mika Kuoppala wrote:
>> From: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
>>
>> The XE2 (and PVC) HW has a limitation that the pagefault due to invalid
>> access will halt the corresponding EUs. To solve this problem, enable
>> EU pagefault handling functionality, which allows to unhalt pagefaulted
>> eu threads and to EU debugger to get inform about the eu attentions state
>> of EU threads during execution.
>>
>> If a pagefault occurs, send the DRM_XE_EUDEBUG_EVENT_PAGEFAULT event
>> after handling the pagefault.
>>
>> The pagefault handling is a mechanism that allows a stalled EU thread to
>> enter SIP mode by installing a temporal null page to the page table entry
>> where the pagefault happened.
>>
>> A brief description of the page fault handling mechanism flow between KMD
>> and the eu thread is as follows
>>
>> (1) eu thread accesses unallocated address
>> (2) pagefault happens and eu thread stalls
>> (3) XE kmd set an force eu thread exception to allow the running eu thread
>>      to enter SIP mode (kmd set ForceException / ForceExternalHalt bit of
>>      TD_CTL register)
>>      Not stalled (none-pagefaulted) eu threads enter SIP mode
>> (4) XE kmd installs temporal null page to the pagetable entry of the
>>      address where pagefault happened.
>> (5) XE kmd replies pagefault successful message to GUC
>> (6) stalled eu thread resumes as per pagefault condition has resolved
>> (7) resumed eu thread enters SIP mode due to force exception set by (3)
>> (8) adapted to consumer/produced pagefaults
>>
>> As designed this feature to only work when eudbug is enabled, it should
>> have no impact to regular recoverable pagefault code path.
>>
>> v2: - pf->q holds the vm ref so drop it (Mika)
>>      - streamline uapi (Mika)
>>      - cleanup the pagefault through producer if (Mika)
>>
>> Signed-off-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com>
>> Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_guc_pagefault.c   |  8 +++++++
>>   drivers/gpu/drm/xe/xe_pagefault.c       | 31 ++++++++++++++++++++++++-
>>   drivers/gpu/drm/xe/xe_pagefault_types.h |  9 +++++++
>>   3 files changed, 47 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_guc_pagefault.c b/drivers/gpu/drm/xe/xe_guc_pagefault.c
>> index d48f6ed103bb..6adf3bf73b1c 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_pagefault.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_pagefault.c
>> @@ -8,6 +8,7 @@
>>   #include "xe_guc_ct.h"
>>   #include "xe_guc_pagefault.h"
>>   #include "xe_pagefault.h"
>> +#include "xe_eudebug_pagefault.h"
>>   
>>   static void guc_ack_fault(struct xe_pagefault *pf, int err)
>>   {
>> @@ -37,8 +38,15 @@ static void guc_ack_fault(struct xe_pagefault *pf, int err)
>>   	xe_guc_ct_send(&guc->ct, action, ARRAY_SIZE(action), 0, 0);
>>   }
>>   
>> +static void guc_cleanup_fault(struct xe_pagefault *pf, int err)
>> +{
>> +	xe_eudebug_pagefault_service(pf);
>> +	xe_eudebug_pagefault_destroy(pf, 0);
>> +}
>> +
>>   static const struct xe_pagefault_ops guc_pagefault_ops = {
>>   	.ack_fault = guc_ack_fault,
>> +	.cleanup_fault = guc_cleanup_fault,
>>   };
>>   
>>   /**
>> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c b/drivers/gpu/drm/xe/xe_pagefault.c
>> index 72f589fd2b64..9dcd854e99f9 100644
>> --- a/drivers/gpu/drm/xe/xe_pagefault.c
>> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
>> @@ -10,6 +10,7 @@
>>   
>>   #include "xe_bo.h"
>>   #include "xe_device.h"
>> +#include "xe_eudebug_pagefault.h"
>>   #include "xe_gt_printk.h"
>>   #include "xe_gt_types.h"
>>   #include "xe_gt_stats.h"
>> @@ -171,6 +172,8 @@ static int xe_pagefault_service(struct xe_pagefault *pf)
>>   	if (IS_ERR(vm))
>>   		return PTR_ERR(vm);
>>   
>> +	xe_eudebug_pagefault_create(vm, pf);
>> +
>>   	/*
>>   	 * TODO: Change to read lock? Using write lock for simplicity.
>>   	 */
>> @@ -184,9 +187,28 @@ static int xe_pagefault_service(struct xe_pagefault *pf)
>>   	vma = xe_vm_find_vma_by_addr(vm, pf->consumer.page_addr);
> 
> I've mentioned this before - this fundamentally broken if SVM is
> enabled as the VMA lookup will never fail given VMA tree is completely
> populated in the SVM cases (i.e., when SVM is enabled the first thing
> the UMD does is bind the entire CPU address space with a CPU mirror
> VMA). What will fail in the SVM case is xe_svm_handle_pagefault will
> likely return -ENOENT. UMDs from my understanding will enable SVM by
> default so this likely needs to be rethought.
> 
Thank you for your reply. Yes, additional implementation of eudebug 
pagefault is required for cases where SVM is used.
As per your comment, in the SVM + eudebug pagefault scenario, if 
xe_svm_handle_pagefault() returns -ENOENT (i.e., when memory allocation 
via mmap etc. is not performed in userspace),
eudebug requires a temporary page install at the address where the page 
fault occurred to allow stalled EU threads to enter SIP.

Two methods come to mind for temporary page installation in the page table:
1) Temporary memory allocation by emulating the implementation of 
do_mmap() / __mmap_region() in an eudebug pagefault situation,
    similar to the general memory allocation scenario in an SVM scenario

    - Pros: Follows the do_mmap() flow for memory allocation; only 
requires one additional call to xe_svm_handle_pagefault() after 
temporary page installation (simple code implementation).
    - Cons: (1) If the installed temporary VMA is not removed, a page 
fault occurring at the same address on the CPU triggers migration to 
system memory,
                preventing the CPU debugger from causing a segmentation 
fault.
            (2) The GPU debugger may fail to handle page faults for 
low-address regions inaccessible to userspace (mmap_min_addr issue)

2) In the SVM scenario, perform a temporary page install only on the GPU 
page table where the page fault occurred during the EUDEBUG page fault 
situation
    : Updating the page table directly without using the 
xe_svm_handle_pagefault() function flow


Could I get your input on these two approaches? Or do you have 
additional thoughts?

G.G.

> Matt
> 
>>   	if (!vma) {
>>   		err = -EINVAL;
>> -		goto unlock_vm;
>> +		vma = xe_eudebug_create_vma(vm, pf);
>> +		if (IS_ERR(vma)) {
>> +			err = PTR_ERR(vma);
>> +			vma = NULL;
>> +		}
>>   	}
>>   
>> +	if (vma) {
>> +		/*
>> +		 * When creating an instance of eudebug_pagefault, there was
>> +		 * no vma containing the ppgtt address where the pagefault occurred,
>> +		 * but when reacquiring vm->lock, there is.
>> +		 * During not aquiring the vm->lock from this context,
>> +		 * but vma corresponding to the address where the pagefault occurred
>> +		 * in another context has allocated.
>> +		 */
>> +		err = 0;
>> +	}
>> +
>> +	if (err)
>> +		goto unlock_vm;
>> +
>>   	atomic = xe_pagefault_access_is_atomic(pf->consumer.access_type);
>>   
>>   	if (xe_vma_is_cpu_addr_mirror(vma))
>> @@ -198,6 +220,10 @@ static int xe_pagefault_service(struct xe_pagefault *pf)
>>   unlock_vm:
>>   	if (!err)
>>   		vm->usm.last_fault_vma = vma;
>> +
>> +	if (err)
>> +		xe_eudebug_pagefault_destroy(pf, err);
>> +
>>   	up_write(&vm->lock);
>>   	xe_vm_put(vm);
>>   
>> @@ -268,6 +294,9 @@ static void xe_pagefault_queue_work(struct work_struct *w)
>>   
>>   		pf.producer.ops->ack_fault(&pf, err);
>>   
>> +		if (pf.producer.ops->cleanup_fault)
>> +			pf.producer.ops->cleanup_fault(&pf, err);
>> +
>>   		if (time_after(jiffies, threshold)) {
>>   			queue_work(gt_to_xe(pf.gt)->usm.pf_wq, w);
>>   			break;
>> diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h b/drivers/gpu/drm/xe/xe_pagefault_types.h
>> index 2bee858da597..9d2d29d35a4b 100644
>> --- a/drivers/gpu/drm/xe/xe_pagefault_types.h
>> +++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
>> @@ -43,6 +43,15 @@ struct xe_pagefault_ops {
>>   	 * sends the result to the HW/FW interface.
>>   	 */
>>   	void (*ack_fault)(struct xe_pagefault *pf, int err);
>> +
>> +	/**
>> +	 * @cleanup_fault: Cleanup for producer, if any
>> +	 * @pf: Page fault
>> +	 * @err: Error state of fault
>> +	 *
>> +	 * Page fault producer received cleanup request from consumer
>> +	 */
>> +	void (*cleanup_fault)(struct xe_pagefault *pf, int err);
>>   };
>>   
>>   /**
>> -- 
>> 2.43.0
>>


  reply	other threads:[~2026-02-27 22:11 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-23 14:02 [PATCH 00/22] Intel Xe GPU Debug Support (eudebug) v7 Mika Kuoppala
2026-02-23 14:02 ` [PATCH 01/22] drm/xe/eudebug: Introduce eudebug interface Mika Kuoppala
2026-02-23 14:02 ` [PATCH 02/22] drm/xe/eudebug: Add documentation Mika Kuoppala
2026-02-23 14:02 ` [PATCH 03/22] drm/xe/eudebug: Add connection establishment documentation Mika Kuoppala
2026-02-23 14:02 ` [PATCH 04/22] drm/xe/eudebug: Introduce discovery for resources Mika Kuoppala
2026-02-23 14:03 ` [PATCH 05/22] drm/xe/eudebug: Introduce exec_queue events Mika Kuoppala
2026-02-23 14:03 ` [PATCH 06/22] drm/xe: Add EUDEBUG_ENABLE exec queue property Mika Kuoppala
2026-02-23 14:03 ` [PATCH 07/22] drm/xe/eudebug: Mark guc contexts as debuggable Mika Kuoppala
2026-02-23 14:03 ` [PATCH 08/22] drm/xe: Introduce ADD_DEBUG_DATA and REMOVE_DEBUG_DATA vm bind ops Mika Kuoppala
2026-02-23 14:03 ` [PATCH 09/22] drm/xe/eudebug: Introduce vm bind and vm bind debug data events Mika Kuoppala
2026-02-23 14:03 ` [PATCH 10/22] drm/xe/eudebug: Add UFENCE events with acks Mika Kuoppala
2026-02-23 14:03 ` [PATCH 11/22] drm/xe/eudebug: vm open/pread/pwrite Mika Kuoppala
2026-02-23 14:03 ` [PATCH 12/22] drm/xe/eudebug: userptr vm pread/pwrite Mika Kuoppala
2026-02-23 14:03 ` [PATCH 13/22] drm/xe/eudebug: hw enablement for eudebug Mika Kuoppala
2026-02-23 14:03 ` [PATCH 14/22] drm/xe/eudebug: Introduce EU control interface Mika Kuoppala
2026-02-23 14:03 ` [PATCH 15/22] drm/xe/eudebug: Introduce per device attention scan worker Mika Kuoppala
2026-02-23 14:03 ` [PATCH 16/22] drm/xe/eudebug_test: Introduce xe_eudebug wa kunit test Mika Kuoppala
2026-02-23 14:03 ` [PATCH 17/22] drm/xe: Implement SR-IOV and eudebug exclusivity Mika Kuoppala
2026-02-23 14:03 ` [PATCH 18/22] drm/xe: Add xe_client_debugfs and introduce debug_data file Mika Kuoppala
2026-02-23 14:03 ` [PATCH 19/22] drm/xe/eudebug: Add read/count/compare helper for eu attention Mika Kuoppala
2026-02-23 14:03 ` [PATCH 20/22] drm/xe/vm: Support for adding null page VMA to VM on request Mika Kuoppala
2026-02-23 14:03 ` [PATCH 21/22] drm/xe/eudebug: Introduce EU pagefault handling interface Mika Kuoppala
2026-02-23 19:08   ` Matthew Brost
2026-02-27 22:10     ` Gwan-gyeong Mun
2026-02-28  0:36       ` Matthew Brost
2026-02-23 14:03 ` [PATCH 22/22] drm/xe/eudebug: Enable EU pagefault handling Mika Kuoppala
2026-02-23 18:41   ` Matthew Brost
2026-02-27 22:11     ` Gwan-gyeong Mun [this message]
2026-02-27 23:11   ` Gustavo Sousa
2026-02-28  6:49     ` Gwan-gyeong Mun
2026-02-23 15:14 ` ✗ CI.checkpatch: warning for Intel Xe GPU Debug Support (eudebug) v7 Patchwork
2026-02-23 15:16 ` ✓ CI.KUnit: success " Patchwork
2026-02-23 15:31 ` ✗ CI.checksparse: warning " Patchwork
2026-02-23 15:51 ` ✓ Xe.CI.BAT: success " Patchwork
2026-02-24  8:42 ` ✗ Xe.CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3d7dde05-c55b-4966-b87c-f72afebdda7d@intel.com \
    --to=gwan-gyeong.mun@intel.com \
    --cc=andrzej.hajda@intel.com \
    --cc=christian.koenig@amd.com \
    --cc=christoph.manszewski@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=joonas.lahtinen@linux.intel.com \
    --cc=maciej.patelczyk@intel.com \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=mika.kuoppala@linux.intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona.vetter@ffwll.ch \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox