From: Michal Wajdeczko <michal.wajdeczko@intel.com>
To: "Anirban, Sk" <sk.anirban@intel.com>,
"Summers, Stuart" <stuart.summers@intel.com>,
"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
Matthew Brost <matthew.brost@intel.com>
Cc: "Jadav, Raag" <raag.jadav@intel.com>,
"Belgaumkar, Vinay" <vinay.belgaumkar@intel.com>,
"Koujalagi, Mallesh" <mallesh.koujalagi@intel.com>,
"Purkait, Soham" <soham.purkait@intel.com>,
"Tauro, Riana" <riana.tauro@intel.com>,
"Nilawar, Badal" <badal.nilawar@intel.com>,
"Poosa, Karthik" <karthik.poosa@intel.com>,
"Gupta, Anshuman" <anshuman.gupta@intel.com>
Subject: Re: [PATCH] drm/xe/guc: suppress GuC error logs when device is wedged
Date: Tue, 21 Apr 2026 19:23:51 +0200 [thread overview]
Message-ID: <71fb777f-1cd7-42ed-95df-08b6fd75f947@intel.com> (raw)
In-Reply-To: <9bc30d5a-06ac-45db-a797-68f2b9c96722@intel.com>
+ Matt
On 4/21/2026 5:44 PM, Anirban, Sk wrote:
> Hi,
>
> On 21-04-2026 01:21 am, Summers, Stuart wrote:
>> On Mon, 2026-04-20 at 16:59 +0530, Sk Anirban wrote:
>>> When the device is wedged, GuC CT sends return -ECANCELED. This is
not 100% true
GuC CT returns -ECANCELED when CT is stopped
GUC CT is stopped also during GT reset
When device is wedged, CT is stopped
>>> expected behavior, not an actionable error. Avoid logging these as
>>> errors in the engine activity and power profile code paths.
>>>
>>> Signed-off-by: Sk Anirban <sk.anirban@intel.com>
>>> ---
>>> drivers/gpu/drm/xe/xe_guc_engine_activity.c | 8 +++++---
>>> drivers/gpu/drm/xe/xe_guc_engine_activity.h | 2 +-
>>> drivers/gpu/drm/xe/xe_guc_pc.c | 4 ++--
>>> drivers/gpu/drm/xe/xe_uc.c | 5 ++++-
>>> 4 files changed, 12 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_guc_engine_activity.c
>>> b/drivers/gpu/drm/xe/xe_guc_engine_activity.c
>>> index 2b99c1ebdd58..700f3464fb63 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc_engine_activity.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc_engine_activity.c
>>> @@ -464,18 +464,20 @@ int
>>> xe_guc_engine_activity_function_stats(struct xe_guc *guc, int
>>> num_vfs, bool
>>> *
>>> * Enable engine activity stats and set initial timestamps
>>> */
>>> -void xe_guc_engine_activity_enable_stats(struct xe_guc *guc)
>>> +int xe_guc_engine_activity_enable_stats(struct xe_guc *guc)
>>> {
>>> int ret;
>>> if (!xe_guc_engine_activity_supported(guc))
>>> - return;
>>> + return 0;
>>> ret = enable_engine_activity_stats(guc);
>>> - if (ret)
>>> + if (ret && !(xe_device_wedged(guc_to_xe(guc)) && ret == -
>>> ECANCELED))
>> Is there a reason we don't handle all of the cases described in
>> __guc_ct_send_locked()? It looks like before we do the ct->state ==
>> STOPPED check (which is where we'd return -ECANCELED), we also check if
>> the CT is broken (i.e. we got some bad return value from GuC and marked
>> CT as "dead", hence returning -EPIPE here) or disabled (and return -
>> ENODEV).
>>
>> Same question for the other cases you have below.
>>
>> Thanks,
>> Stuart
> -ECANCELED is the specific error returned when the CT is stopped & device is wedged.
to be clear: -ECANCELED was introduced only to indicate that H2Gs
are lost due to CT being stopped, usually as part of the GT reset
see dc75d03716fe and 94de94d24ea8
and it doesn't mean that device was wedged - hence extra checks...
> Other errors may indicate different fault conditions and imo should be useful to log those.
>
> This follows the same pattern as pc_action_reset.
but that pattern doesn't look great either
maybe we should add that wedged check at the CT layer and then use
different error code, like -ENOTRECOVERABLE, to avoid duplicating
the same condition by the all callers?
OTOH, if we expect that there is no point in reporting errors
after we declare WEDGED state, maybe the same rule should apply
to the errors after GT reset? so we can just look for -ECANCELED?
btw, IMO we should rather focus on avoiding going to wedged state
than trying to silence any follow-up error messages (that to some
extend proves that driver either correctly noticed the fault or
that we missed to perform some explicit cleanups and driver still
continues to do something that shouldn't be doing after wedged)
>
> Thanks,
>
> Anirban
>
>>
>>> xe_gt_err(guc_to_gt(guc), "failed to enable activity
>>> stats%d\n", ret);
>>> else
>>> engine_activity_set_cpu_ts(guc, 0);
>>> +
>>> + return ret;
>>> }
>>> static void engine_activity_fini(void *arg)
>>> diff --git a/drivers/gpu/drm/xe/xe_guc_engine_activity.h
>>> b/drivers/gpu/drm/xe/xe_guc_engine_activity.h
>>> index b32926c2d208..188f325a462d 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc_engine_activity.h
>>> +++ b/drivers/gpu/drm/xe/xe_guc_engine_activity.h
>>> @@ -13,7 +13,7 @@ struct xe_guc;
>>> int xe_guc_engine_activity_init(struct xe_guc *guc);
>>> bool xe_guc_engine_activity_supported(struct xe_guc *guc);
>>> -void xe_guc_engine_activity_enable_stats(struct xe_guc *guc);
>>> +int xe_guc_engine_activity_enable_stats(struct xe_guc *guc);
>>> int xe_guc_engine_activity_function_stats(struct xe_guc *guc, int
>>> num_vfs, bool enable);
>>> u64 xe_guc_engine_activity_active_ticks(struct xe_guc *guc, struct
>>> xe_hw_engine *hwe,
>>> unsigned int fn_id);
>>> diff --git a/drivers/gpu/drm/xe/xe_guc_pc.c
>>> b/drivers/gpu/drm/xe/xe_guc_pc.c
>>> index 7ecd91ad6192..efcd432ef6ef 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc_pc.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc_pc.c
>>> @@ -1202,7 +1202,7 @@ int xe_guc_pc_set_power_profile(struct
>>> xe_guc_pc *pc, const char *buf)
>>> ret = pc_action_set_param(pc,
>>> SLPC_PARAM_POWER_PROFILE,
>>> val);
>>> - if (ret)
>>> + if (ret && !(xe_device_wedged(pc_to_xe(pc)) && ret == -
>>> ECANCELED))
>>> xe_gt_err_once(pc_to_gt(pc), "Failed to set power
>>> profile to %d: %pe\n",
>>> val, ERR_PTR(ret));
>>> else
>>> @@ -1306,7 +1306,7 @@ int xe_guc_pc_start(struct xe_guc_pc *pc)
>>> /* Set cached value of power_profile */
>>> ret = xe_guc_pc_set_power_profile(pc,
>>> power_profile_to_string(pc));
>>> - if (unlikely(ret))
>>> + if (ret && !(xe_device_wedged(xe) && ret == -ECANCELED))
>>> xe_gt_err(gt, "Failed to set SLPC power profile:
>>> %pe\n", ERR_PTR(ret));
>>> return ret;
>>> diff --git a/drivers/gpu/drm/xe/xe_uc.c b/drivers/gpu/drm/xe/xe_uc.c
>>> index 75091bde0d50..b440cf8c431d 100644
>>> --- a/drivers/gpu/drm/xe/xe_uc.c
>>> +++ b/drivers/gpu/drm/xe/xe_uc.c
>>> @@ -215,7 +215,10 @@ int xe_uc_load_hw(struct xe_uc *uc)
>>> if (ret)
>>> return ret;
>>> - xe_guc_engine_activity_enable_stats(&uc->guc);
>>> + ret = xe_guc_engine_activity_enable_stats(&uc->guc);
>>> +
>>> + if (xe_device_wedged(guc_to_xe(&uc->guc)) && ret == -
>>> ECANCELED)
>>> + return ret;
>>> /* We don't fail the driver load if HuC fails to auth */
>>> ret = xe_huc_auth(&uc->huc, XE_HUC_AUTH_VIA_GUC);
next prev parent reply other threads:[~2026-04-21 17:24 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-20 11:29 [PATCH] drm/xe/guc: suppress GuC error logs when device is wedged Sk Anirban
2026-04-20 19:51 ` Summers, Stuart
2026-04-21 15:44 ` Anirban, Sk
2026-04-21 17:23 ` Michal Wajdeczko [this message]
2026-04-29 19:12 ` Anirban, Sk
2026-04-20 23:23 ` ✓ CI.KUnit: success for " Patchwork
2026-04-21 0:10 ` ✓ Xe.CI.BAT: " Patchwork
2026-04-21 2:54 ` ✗ Xe.CI.FULL: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=71fb777f-1cd7-42ed-95df-08b6fd75f947@intel.com \
--to=michal.wajdeczko@intel.com \
--cc=anshuman.gupta@intel.com \
--cc=badal.nilawar@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=karthik.poosa@intel.com \
--cc=mallesh.koujalagi@intel.com \
--cc=matthew.brost@intel.com \
--cc=raag.jadav@intel.com \
--cc=riana.tauro@intel.com \
--cc=sk.anirban@intel.com \
--cc=soham.purkait@intel.com \
--cc=stuart.summers@intel.com \
--cc=vinay.belgaumkar@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox