Re: [PATCH v3 4/4] drm/xe/guc: Track FAST_REQ H2Gs to report where errors came from

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: John Harrison <john.c.harrison@intel.com>
To: Michal Wajdeczko <michal.wajdeczko@intel.com>,
	<Intel-Xe@Lists.FreeDesktop.Org>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Subject: Re: [PATCH v3 4/4] drm/xe/guc: Track FAST_REQ H2Gs to report where errors came from
Date: Mon, 12 May 2025 12:52:48 -0700	[thread overview]
Message-ID: <5c7bacb4-d1ba-4980-92df-085c4f405762@intel.com> (raw)
In-Reply-To: <9c82c88b-9d17-4954-ae84-a89e73773038@intel.com>

On 5/8/2025 1:11 PM, Michal Wajdeczko wrote:
> On 08.05.2025 03:34, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> Most H2G messages are FAST_REQ which means no synchronous response is
>> expected. The messages are sent as fire-and-forget with no tracking.
>> However, errors can still be returned when something goes unexpectedly
>> wrong. That leads to confusion due to not being able to match up the
>> error response to the originating H2G.
>>
>> So add support for tracking the FAST_REQ H2Gs and matching up an error
>> response to its originator. This is only enabled in XE_DEBUG builds
>> given that such errors should never happen in a working system and
>> there is an overhead for the tracking.
>>
>> Further, if XE_DEBUG_GUC is enabled then even more memory and time is
>> used to record the call stack of each H2G and report that with the
>> error. That makes it much easier to work out where a specific H2G came
>> from if there are multiple code paths that can send it.
>>
>> v2: Some re-wording of comments and prints, more consistent use of #if
>> vs stub functions - review feedback from Daniele & Michal).
>> v3: Split config change to separate patch, improve a debug print
>> (review feedback from Michal).
>>
>> Original-i915-code: Michal Wajdeczko <michal.wajdeczko@intel.com>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
>
> with few nits below
>
>> ---
>>   drivers/gpu/drm/xe/Kconfig.debug     |   5 +-
>>   drivers/gpu/drm/xe/xe_guc_ct.c       | 116 ++++++++++++++++++++++-----
>>   drivers/gpu/drm/xe/xe_guc_ct_types.h |  15 ++++
>>   3 files changed, 116 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/Kconfig.debug b/drivers/gpu/drm/xe/Kconfig.debug
>> index db063a513b1e..01735c6ece8b 100644
>> --- a/drivers/gpu/drm/xe/Kconfig.debug
>> +++ b/drivers/gpu/drm/xe/Kconfig.debug
>> @@ -90,10 +90,13 @@ config DRM_XE_DEBUG_GUC
>>           bool "Enable extra GuC related debug options"
>>           depends on DRM_XE_DEBUG
>>           default n
>> +        select STACKDEPOT
>>           help
>>             Choose this option when debugging guc issues.
>>             The GuC log buffer is increased to the maximum allowed, which should
>> -          be large enough for complex issues.
>> +          be large enough for complex issues. The tracking of FAST_REQ messages
>> +          is extended to include a record of the calling stack, which is then
>> +          dumped on a FAST_REQ error notification.
>>   
>>             Recommended for driver developers only.
>>   
>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
>> index 9213fdc25950..2d38aea9c0a2 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>> @@ -625,6 +625,47 @@ static void g2h_release_space(struct xe_guc_ct *ct, u32 g2h_len)
>>   	spin_unlock_irq(&ct->fast_lock);
>>   }
>>   
>> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
>> +static void fast_req_track(struct xe_guc_ct *ct, u16 fence, u16 action)
>> +{
>> +	unsigned int slot = fence % ARRAY_SIZE(ct->fast_req);
>> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG_GUC)
>> +	unsigned long entries[SZ_32];
>> +	unsigned int n;
>> +
>> +	n = stack_trace_save(entries, ARRAY_SIZE(entries), 1);
>> +
>> +	/* May be called under spinlock, so avoid sleeping */
>> +	ct->fast_req[slot].stack = stack_depot_save(entries, n, GFP_NOWAIT);
>> +#endif
>> +	ct->fast_req[slot].fence = fence;
>> +	ct->fast_req[slot].action = action;
>> +}
>> +#else
>> +static void fast_req_track(struct xe_guc_ct *ct, u16 fence, u16 action)
>> +{
>> +}
>> +#endif
>> +
>> +/*
>> + * The CT protocol accepts a 16 bits fence. This field is fully owned by the
>> + * driver, the GuC will just copy it to the reply message. Since we need to
>> + * be able to distinguish between replies to REQUEST and FAST_REQUEST messages,
>> + * we use one bit of the seqno as an indicator for that and a rolling counter
>> + * for the remaining 15 bits.
>> + */
>> +#define CT_SEQNO_MASK GENMASK(14, 0)
>> +#define CT_SEQNO_UNTRACKED BIT(15)
>> +static u16 next_ct_seqno(struct xe_guc_ct *ct, bool is_g2h_fence)
>> +{
>> +	u32 seqno = ct->fence_seqno++ & CT_SEQNO_MASK;
>> +
>> +	if (!is_g2h_fence)
>> +		seqno |= CT_SEQNO_UNTRACKED;
>> +
>> +	return seqno;
>> +}
>> +
>>   #define H2G_CT_HEADERS (GUC_CTB_HDR_LEN + 1) /* one DW CTB header and one DW HxG header */
>>   
>>   static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
>> @@ -716,6 +757,10 @@ static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
>>   	xe_map_memcpy_to(xe, &map, H2G_CT_HEADERS * sizeof(u32), action, len * sizeof(u32));
>>   	xe_device_wmb(xe);
>>   
>> +	if (ct_fence_value & CT_SEQNO_UNTRACKED)
> shouldn't we use "want_response" instead?
>
> it will be then consistent with the code below which selects whether the
> request will be send as GUC_HXG_TYPE_REQUEST or FAST_REQUEST
You mean the code above? Yeah, I guess. The two are basically derived 
from the same source but it makes sense to use the locally cached 
version and be consistent with the earlier test.

>
>> +		fast_req_track(ct, ct_fence_value,
>> +			       FIELD_GET(GUC_HXG_EVENT_MSG_0_ACTION, action[0]));
>> +
>>   	/* Update local copies */
>>   	h2g->info.tail = (tail + full_len) % h2g->info.size;
>>   	h2g_reserve_space(ct, full_len);
>> @@ -733,25 +778,6 @@ static int h2g_write(struct xe_guc_ct *ct, const u32 *action, u32 len,
>>   	return -EPIPE;
>>   }
>>   
>> -/*
>> - * The CT protocol accepts a 16 bits fence. This field is fully owned by the
>> - * driver, the GuC will just copy it to the reply message. Since we need to
>> - * be able to distinguish between replies to REQUEST and FAST_REQUEST messages,
>> - * we use one bit of the seqno as an indicator for that and a rolling counter
>> - * for the remaining 15 bits.
>> - */
>> -#define CT_SEQNO_MASK GENMASK(14, 0)
>> -#define CT_SEQNO_UNTRACKED BIT(15)
>> -static u16 next_ct_seqno(struct xe_guc_ct *ct, bool is_g2h_fence)
>> -{
>> -	u32 seqno = ct->fence_seqno++ & CT_SEQNO_MASK;
>> -
>> -	if (!is_g2h_fence)
>> -		seqno |= CT_SEQNO_UNTRACKED;
>> -
>> -	return seqno;
>> -}
>> -
>>   static int __guc_ct_send_locked(struct xe_guc_ct *ct, const u32 *action,
>>   				u32 len, u32 g2h_len, u32 num_g2h,
>>   				struct g2h_fence *g2h_fence)
>> @@ -1143,6 +1169,55 @@ static int guc_crash_process_msg(struct xe_guc_ct *ct, u32 action)
>>   	return 0;
>>   }
>>   
>> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
>> +static void fast_req_report(struct xe_guc_ct *ct, u16 fence)
>> +{
>> +	u16 fence_min = (u16)~0U, fence_max = 0;
> fence_min = U16_MAX
Doh!

>
>> +	struct xe_gt *gt = ct_to_gt(ct);
>> +	bool found = false;
>> +	unsigned int n;
>> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG_GUC)
>> +	char *buf;
>> +#endif
>> +
>> +	lockdep_assert_held(&ct->lock);
>> +
>> +	for (n = 0; n < ARRAY_SIZE(ct->fast_req); n++) {
>> +		if (ct->fast_req[n].fence < fence_min)
>> +			fence_min = ct->fast_req[n].fence;
>> +		if (ct->fast_req[n].fence > fence_max)
>> +			fence_max = ct->fast_req[n].fence;
>> +
>> +		if (ct->fast_req[n].fence != fence)
>> +			continue;
>> +		found = true;
>> +
>> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG_GUC)
>> +		buf = kmalloc(SZ_4K, GFP_NOWAIT);
>> +		if (buf && stack_depot_snprint(ct->fast_req[n].stack, buf, SZ_4K, 0))
>> +			xe_gt_err(gt, "Fence 0x%x was used by action %#04x sent at:\n%s",
>> +				  fence, ct->fast_req[n].action, buf);
>> +		else
>> +			xe_gt_err(gt, "Fence 0x%x was used by action %#04x [failed to retrieve stack]\n",
>> +				  fence, ct->fast_req[n].action);
>> +		kfree(buf);
>> +#else
>> +		xe_gt_err(gt, "Fence 0x%x was used by action %#04x\n",
>> +			  fence, ct->fast_req[n].action);
>> +#endif
>> +		break;
>> +	}
>> +
>> +	if (!found)
>> +		xe_gt_warn(gt, "Fence 0x%x not found - tracking buffer wrapped? [range = 0x%x -> 0x%x]\n",
>> +			   fence, fence_min, fence_max);
> maybe we should also print current ct->fence_seqno to rule out
> completely broken received fence?
Not sure I follow. The only completely broken value would be one that is 
 >16 bits. Including the next value to be sent will give you an idea of 
how far backed up the queue is. So yeah, I can certainly add it in. But 
it won't tell you whether something is broken or not.


>
>> +}
>> +#else
>> +static void fast_req_report(struct xe_guc_ct *ct, u16 fence)
>> +{
>> +}
>> +#endif
>> +
>>   static int parse_g2h_response(struct xe_guc_ct *ct, u32 *msg, u32 len)
>>   {
>>   	struct xe_gt *gt =  ct_to_gt(ct);
>> @@ -1171,6 +1246,9 @@ static int parse_g2h_response(struct xe_guc_ct *ct, u32 *msg, u32 len)
>>   		else
>>   			xe_gt_err(gt, "unexpected response %u for FAST_REQ H2G fence 0x%x!\n",
>>   				  type, fence);
>> +
>> +		fast_req_report(ct, fence);
>> +
>>   		CT_DEAD(ct, NULL, PARSE_G2H_RESPONSE);
>>   
>>   		return -EPROTO;
>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct_types.h b/drivers/gpu/drm/xe/xe_guc_ct_types.h
>> index 8e1b9d981d61..f58cea36c3c5 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_ct_types.h
>> +++ b/drivers/gpu/drm/xe/xe_guc_ct_types.h
>> @@ -9,6 +9,7 @@
>>   #include <linux/interrupt.h>
>>   #include <linux/iosys-map.h>
>>   #include <linux/spinlock_types.h>
>> +#include <linux/stackdepot.h>
>>   #include <linux/wait.h>
>>   #include <linux/xarray.h>
>>   
>> @@ -104,6 +105,18 @@ struct xe_dead_ct {
>>   	/** snapshot_log: copy of GuC log at point of error */
>>   	struct xe_guc_log_snapshot *snapshot_log;
>>   };
>> +
>> +/** struct xe_fast_req_fence - Used to track FAST_REQ messages by fence to match error responses */
>> +struct xe_fast_req_fence {
>> +	/** @fence: sequence number sent in H2G and return in G2H error */
>> +	u16 fence;
>> +	/** @action: H2G action code */
>> +	u16 action;
>> +#if IS_ENABLED(CONFIG_DRM_XE_DEBUG_GUC)
>> +	/** @stack: call stack from when the H2G was sent */
>> +	depot_stack_handle_t stack;
>> +#endif
>> +};
>>   #endif
>>   
>>   /**
>> @@ -152,6 +165,8 @@ struct xe_guc_ct {
>>   #if IS_ENABLED(CONFIG_DRM_XE_DEBUG)
>>   	/** @dead: information for debugging dead CTs */
>>   	struct xe_dead_ct dead;
>> +	/** @fast_req: history of FAST_REQ messages for matching with G2H error responses*/
> no trailing space before */
I'm surprised checkpatch doesn't check for that.

John.

>
>> +	struct xe_fast_req_fence fast_req[SZ_32];
>>   #endif
>>   };
>>

next prev parent reply	other threads:[~2025-05-12 19:53 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-08  1:34 [PATCH v3 0/4] Track FAST_REQ H2Gs to report where errors came from John.C.Harrison
2025-05-08  1:34 ` [PATCH v3 1/4] drm/xe/guc: Remove double blank line John.C.Harrison
2025-05-08 19:37   ` Michal Wajdeczko
2025-05-08  1:34 ` [PATCH v3 2/4] drm/xe/guc: Add missing H2G error code definitions John.C.Harrison
2025-05-08 19:50   ` Michal Wajdeczko
2025-05-12 19:30     ` John Harrison
2025-05-08  1:34 ` [PATCH v3 3/4] drm/xe/guc: Rename CONFIG_XE_LARGE_GUC_BUFFER John.C.Harrison
2025-05-08 19:52   ` Michal Wajdeczko
2025-05-08  1:34 ` [PATCH v3 4/4] drm/xe/guc: Track FAST_REQ H2Gs to report where errors came from John.C.Harrison
2025-05-08 20:11   ` Michal Wajdeczko
2025-05-12 19:52     ` John Harrison [this message]
2025-05-08  1:40 ` ✓ CI.Patch_applied: success for Track FAST_REQ H2Gs to report where errors came from (rev2) Patchwork
2025-05-08  1:40 ` ✗ CI.checkpatch: warning " Patchwork
2025-05-08  1:41 ` ✓ CI.KUnit: success " Patchwork
2025-05-08  1:50 ` ✓ CI.Build: " Patchwork
2025-05-08  1:52 ` ✓ CI.Hooks: " Patchwork
2025-05-08  1:53 ` ✓ CI.checksparse: " Patchwork
2025-05-08  2:15 ` ✓ Xe.CI.BAT: " Patchwork
2025-05-08 22:58 ` ✗ Xe.CI.Full: failure " Patchwork
2025-05-26 18:39 ` ✗ CI.Patch_applied: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5c7bacb4-d1ba-4980-92df-085c4f405762@intel.com \
    --to=john.c.harrison@intel.com \
    --cc=Intel-Xe@Lists.FreeDesktop.Org \
    --cc=daniele.ceraolospurio@intel.com \
    --cc=michal.wajdeczko@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox