Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Gupta, Varun" <varun.gupta@intel.com>
To: "Summers, Stuart" <stuart.summers@intel.com>,
	"Brost, Matthew" <matthew.brost@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	"Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>,
	"Roper, Matthew D" <matthew.d.roper@intel.com>,
	"Dandamudi, Priyanka" <priyanka.dandamudi@intel.com>
Subject: Re: [PATCH v3 2/2] drm/xe: Add prefetch fault support for Xe3p
Date: Thu, 19 Feb 2026 15:16:02 +0530	[thread overview]
Message-ID: <c26fab51-24d6-4478-9cd7-65886914b2d5@intel.com> (raw)
In-Reply-To: <bbbe5c479327e25a2f25c5d4e4b24189f97cdce8.camel@intel.com>

[-- Attachment #1: Type: text/plain, Size: 12021 bytes --]


On 03-Feb-26 4:04 AM, Summers, Stuart wrote:
> On Mon, 2026-02-02 at 14:24 -0800, Matthew Brost wrote:
>> On Mon, Feb 02, 2026 at 01:50:01PM -0700, Summers, Stuart wrote:
>>> On Mon, 2026-02-02 at 10:55 +0530, Varun Gupta wrote:
>>>> Xe3p hardware prefetches memory ranges and notifies software via
>>>> an
>>>> additional bit (bit 11) in the page fault descriptor that the
>>>> fault
>>>> was caused by prefetch.
>>>>
>>>> Extract the prefetch bit from the fault descriptor. When page
>>>> fault
>>>> handling fails, echo the prefetch bit in the response (bit 6) to
>>>> allow
>>>> the HW to suppress CAT errors for unsuccessful prefetch faults.
>>>> On
>>>> successful handling, clear the prefetch bit so it's not echoed.
>>>>
>>>> For failed prefetch faults, increment a stats counter and print a
>>>> single-line error message with the prefetch bit value to reduce
>>>> excessive logging.
>>>>
>>>> Based on original patches by Brian Welty<brian.welty@intel.com>
>>>> and
>>>> Priyanka Dandamudi<priyanka.dandamudi@intel.com>.
>>>>
>>>> Bspec: 59311
>>>> Originally-by: Lucas De Marchi<lucas.demarchi@intel.com>
>>>> Cc: Matthew Brost<matthew.brost@intel.com>
>>>> Cc: Priyanka Dandamudi<priyanka.dandamudi@intel.com>
>>>> Cc: Matt Roper<matthew.d.roper@intel.com>
>>>> Signed-off-by: Lucas De Marchi<lucas.demarchi@intel.com>
>>>> Signed-off-by: Varun Gupta<varun.gupta@intel.com>
>>>>
>>>> ---
>>>> v3:
>>>>   - Drop the rename patch, keep xe_pagefault_print() unchanged
>>>> (Matt
>>>> Brost)
>>>>   - Move prefetch check to caller instead of inside print function
>>>> (Matt Brost)
>>>>   - Remove XE3P_ prefix from prefetch bit defines and add platform
>>>> comment (Matt Brost)
>>>>   - Show prefetch bit in error messages for debugging (Matt Brost)
>>>>   - Split stats counter into separate patch (Matt Brost)
>>>>
>>>> v2:
>>>>   - Changed comment wording from "repairs" to "handling" for
>>>> clarity
>>>> (Matt Roper)
>>>> ---
>>>>   drivers/gpu/drm/xe/xe_guc_fwif.h        |  5 +++--
>>>>   drivers/gpu/drm/xe/xe_guc_pagefault.c   |  2 ++
>>>>   drivers/gpu/drm/xe/xe_pagefault.c       | 16 +++++++++++++---
>>>>   drivers/gpu/drm/xe/xe_pagefault_types.h |  8 +++++++-
>>>>   4 files changed, 25 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_fwif.h
>>>> b/drivers/gpu/drm/xe/xe_guc_fwif.h
>>>> index a33ea288b907..1a8674daa26e 100644
>>>> --- a/drivers/gpu/drm/xe/xe_guc_fwif.h
>>>> +++ b/drivers/gpu/drm/xe/xe_guc_fwif.h
>>>> @@ -261,7 +261,8 @@ struct xe_guc_pagefault_desc {
>>>>   #define PFD_ACCESS_TYPE                GENMASK(1, 0)
>>>>   #define PFD_FAULT_TYPE         GENMASK(3, 2)
>>>>   #define PFD_VFID               GENMASK(9, 4)
>>>> -#define PFD_RSVD_1             GENMASK(11, 10)
>>>> +#define PFD_RSVD_1             BIT(10)
>>>> +#define PFD_PREFETCH           BIT(11) /* Only valid on Xe3+,
>>>> reserved on prior platforms */
>>>>   #define PFD_VIRTUAL_ADDR_LO    GENMASK(31, 12)
>>>>   #define PFD_VIRTUAL_ADDR_LO_SHIFT 12
>>>>   
>>>> @@ -281,7 +282,7 @@ struct xe_guc_pagefault_reply {
>>>>   
>>>>          u32 dw1;
>>>>   #define PFR_VFID               GENMASK(5, 0)
>>>> -#define PFR_RSVD_1             BIT(6)
>>>> +#define PFR_PREFETCH           BIT(6)  /* Only valid on Xe3+,
>>>> reserved on prior platforms */
>>>>   #define PFR_ENG_INSTANCE       GENMASK(12, 7)
>>>>   #define PFR_ENG_CLASS          GENMASK(15, 13)
>>>>   #define PFR_PDATA              GENMASK(31, 16)
>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_pagefault.c
>>>> b/drivers/gpu/drm/xe/xe_guc_pagefault.c
>>>> index 719a18187a31..ca7f769848a9 100644
>>>> --- a/drivers/gpu/drm/xe/xe_guc_pagefault.c
>>>> +++ b/drivers/gpu/drm/xe/xe_guc_pagefault.c
>>>> @@ -27,6 +27,7 @@ static void guc_ack_fault(struct xe_pagefault
>>>> *pf,
>>>> int err)
>>>>                  FIELD_PREP(PFR_ASID, pf->consumer.asid),
>>>>   
>>>>                  FIELD_PREP(PFR_VFID, vfid) |
>>>> +               FIELD_PREP(PFR_PREFETCH, pf->consumer.prefetch) |
>>>>                  FIELD_PREP(PFR_ENG_INSTANCE, engine_instance) |
>>>>                  FIELD_PREP(PFR_ENG_CLASS, engine_class) |
>>>>                  FIELD_PREP(PFR_PDATA, pdata),
>>>> @@ -77,6 +78,7 @@ int xe_guc_pagefault_handler(struct xe_guc
>>>> *guc,
>>>> u32 *msg, u32 len)
>>>>          pf.consumer.asid = FIELD_GET(PFD_ASID, msg[1]);
>>>>          pf.consumer.access_type = FIELD_GET(PFD_ACCESS_TYPE,
>>>> msg[2]);
>>>>          pf.consumer.fault_type = FIELD_GET(PFD_FAULT_TYPE,
>>>> msg[2]);
>>>> +       pf.consumer.prefetch = FIELD_GET(PFD_PREFETCH, msg[2]);
>>>>          if (FIELD_GET(XE2_PFD_TRVA_FAULT, msg[0]))
>>>>                  pf.consumer.fault_level =
>>>> XE_PAGEFAULT_LEVEL_NACK;
>>>>          else
>>>> diff --git a/drivers/gpu/drm/xe/xe_pagefault.c
>>>> b/drivers/gpu/drm/xe/xe_pagefault.c
>>>> index 6bee53d6ffc3..733d4ad28914 100644
>>>> --- a/drivers/gpu/drm/xe/xe_pagefault.c
>>>> +++ b/drivers/gpu/drm/xe/xe_pagefault.c
>>>> @@ -259,9 +259,19 @@ static void xe_pagefault_queue_work(struct
>>>> work_struct *w)
>>>>   
>>>>                  err = xe_pagefault_service(&pf);
>>>>                  if (err) {
>>>> -                       xe_pagefault_print(&pf);
>>>> -                       xe_gt_info(pf.gt, "Fault response:
>>>> Unsuccessful %pe\n",
>>>> -                                  ERR_PTR(err));
>>>> +                       if (!pf.consumer.prefetch) {
>>>> +                               xe_pagefault_print(&pf);
>>>> +                       } else {
>>>> +                               xe_gt_stats_incr(pf.gt,
>>>> XE_GT_STATS_ID_INVALID_PREFETCH_PAGEFAULT_COUNT, 1);
>>>> +                       }
>> You don't need {} in the if / else statement.
>>
>>>> +                       xe_gt_info(pf.gt, "Fault response:
>>>> Unsuccessful %pe, prefetch=%d\n",
>>>> +                                  ERR_PTR(err),
>>>> pf.consumer.prefetch);
>>> Does it make sense to rate limit this message in case the test
>>> sends
>>> this over and over? I guess this wouldn't be much different from
>>> the
>>> normal case though so not required in this patch.
>>>
>> We only have xe_gt_err_ratelimited, so I'd say this probably fine as
>> is.
>>
>> Or maybe if prefetch is set we downgrade the message to dbg level?
>> This
>> should avoid spam in typical production settings.
> I like the idea of moving this to a debug print, but I also don't know
> that this needs to block the review since it was already an info
> before..
Noted. Will change this to dbg.
>>>> +               } else {
>>>> +                       /*
>>>> +                        * Clear prefetch bit - only needed to
>>>> suppress CAT errors
>>>> +                        * on unsuccessful handling.
>>> So bspec indicates this response bit is used to indicate either a
>>> prefetch memory access response or to suppress fault related cat
>>> errors. So shouldn't we be leaving this as-is here?
>>>
>> I would agree we probably shouldn't be touching this bit here. I
>> don't
>> have test platform, nor is one in CI yet, to verify that it is safe
>> to
>> leave untouched though - bspec can be wrong.
> I don't know of any specific side effect to a prefetch response here vs
> just leaving it as a successful, non-prefetch response which I think
> hardware probably drops also. I just think for safety reasons (what if
> that hardware handling changes in the future) it's best to follow bspec
> and respond how we received it. But yeah definitely agree that should
> be tested before merging.

Suppressing CAT errors on unsuccessful prefetch faults is the intended 
behavior,
the hardware prefetch mechanism is designed to tolerate failures 
silently rather
than escalating to an engine reset.

Clearing it on the success path is purely defensive — the fault is 
resolved, there
is nothing to suppress, so we zero it out to avoid sending a bit with no 
defined
meaning in that context.

Thanks,
Varun

>>> And if we aren't clearing this in the if (err) part of the
>>> condition
>>> above, we won't escalate to a cat fault (since it is suppressed),
>>> is
>>> that what we want here? Or we're worried about a storm of cat
>>> faults
>> I believe clearing if (err) should actually depend the VM's settings.
>>
>> If the VM has scratch - we should probably print the fault + trigger
>> a
>> CAT error as in this case prefetch shouldn't ever fail unless we have
>> software bug in the KMD.
>>
>> If the VM doesn't have scratch - it is somewhat normal for a prefetch
>> fault to be unsuccessful. My understanding is compute kernel
>> regularly issue prefetches to what may be invalid memory as the
>> kernel compiler more or less blindly inserts these not knowing the
>> memory bounds. In this case, we don't want to kill the kernel. This
>> is
>> part of reason we added scratch support on faulting VMs to avoid
>> prefetch fault storms to invalid memory and IIRC can turn off
>> prefetch
>> faults (without scratch W/A) in subsequent platforms.
> Yeah tying to scratch makes sense to me too. I get that the
> application/compute kernel might do something out of bounds
> "unexpectedly" (or as you mentioned because the compiler isn't being
> precise for whatever reason). But we should have a consistent
> implementation in the driver to handle that case rather than handling
> separately here and for the scratch on/off case. Eventually my
> understanding is we want to be able to drop scratch support once we can
> get the applications/compilers to guarantee in-bounds accesses - I
> realize this might not happen any time soon though...
>
> Thanks,
> Stuart
>
>> Matt
>>
>>> and engine resets? But as I mentioned above, I don't see why this
>>> case
>>> would really be different from a normal cat fault in terms of
>>> frequency
>>> from a buggy application.
>>>
>>> Thanks,
>>> Stuart
>>>
>>>> +                        */
>>>> +                       pf.consumer.prefetch = 0;
>>>>                  }
>>>>   
>>>>                  pf.producer.ops->ack_fault(&pf, err);
>>>> diff --git a/drivers/gpu/drm/xe/xe_pagefault_types.h
>>>> b/drivers/gpu/drm/xe/xe_pagefault_types.h
>>>> index d3b516407d60..9e38d6e2dac5 100644
>>>> --- a/drivers/gpu/drm/xe/xe_pagefault_types.h
>>>> +++ b/drivers/gpu/drm/xe/xe_pagefault_types.h
>>>> @@ -84,8 +84,14 @@ struct xe_pagefault {
>>>>                  u8 engine_class;
>>>>                  /** @consumer.engine_instance: engine instance */
>>>>                  u8 engine_instance;
>>>> +               /**
>>>> +                * @consumer.prefetch: fault is caused by HW
>>>> prefetch.
>>>> +                * Echo in response to suppress CAT errors on
>>>> +                * unsuccessful handling.
>>>> +                */
>>>> +               u8 prefetch;
>>>>                  /** consumer.reserved: reserved bits for future
>>>> expansion */
>>>> -               u8 reserved[7];
>>>> +               u8 reserved[6];
>>>>          } consumer;
>>>>          /**
>>>>           * @producer: State for the producer (i.e., HW/FW
>>>> interface).
>>>> Populated

[-- Attachment #2: Type: text/html, Size: 18639 bytes --]

  reply	other threads:[~2026-02-19  9:46 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-02  5:25 [PATCH v3 0/2] drm/xe: Add prefetch pagefault support for Xe3p Varun Gupta
2026-02-02  5:25 ` [PATCH v3 1/2] drm/xe: Add counter for invalid prefetch pagefaults Varun Gupta
2026-02-02  5:25 ` [PATCH v3 2/2] drm/xe: Add prefetch fault support for Xe3p Varun Gupta
2026-02-02 20:50   ` Summers, Stuart
2026-02-02 22:24     ` Matthew Brost
2026-02-02 22:34       ` Summers, Stuart
2026-02-19  9:46         ` Gupta, Varun [this message]
2026-02-02  6:00 ` ✗ CI.checkpatch: warning for drm/xe: Add prefetch pagefault support for Xe3p (rev3) Patchwork
2026-02-02  6:01 ` ✓ CI.KUnit: success " Patchwork
2026-02-02  6:35 ` ✓ Xe.CI.BAT: " Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2026-02-02  5:14 [PATCH v3 0/2] drm/xe: Add prefetch fault support for Xe3p Varun Gupta
2026-02-02  5:14 ` [PATCH v3 2/2] " Varun Gupta
2026-01-28  9:11 [PATCH v3 0/2] " Varun Gupta
2026-01-28  9:11 ` [PATCH v3 2/2] drm/xe: " Varun Gupta

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c26fab51-24d6-4478-9cd7-65886914b2d5@intel.com \
    --to=varun.gupta@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=matthew.d.roper@intel.com \
    --cc=priyanka.dandamudi@intel.com \
    --cc=stuart.summers@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox